vic
89 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c2a170c293 |
FROMGIT: cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes:
|
||
|
|
f7c2472acb |
BACKPORT: cgroup: make per-cgroup pressure stall tracking configurable
PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/patchwork/patch/1435705
(cherry picked from commit 3958e2d0c34e18c41b60dc01832bd670a59ef70f
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git tj)
Conflicts:
include/linux/cgroup-defs.h
kernel/cgroup/cgroup.c
1. Trivial merge conflict in cgroup-defs.h due to missing CFTYPE_DEBUG
2. Changed flags to (CFTYPE_NOT_ON_ROOT | CFTYPE_PRESSURE) in cgroup.c
because in 4.19 psi files were allowed only in non-root cgroups.
Bug: 178872719
Bug: 191734423
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ifc8fbc52f9a1131d7c2668edbb44c525c76c3360
Git-commit:
|
||
|
|
5318f3163c |
BACKPORT: cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa
(cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873)
cgroup-defs.h: use the struct cgroup_freezer_state and the
freezer field from definitions in I6221a975c04f06249a4f8d693852776ae08a8d8e
sched.h: use the frozen field defined in
I6221a975c04f06249a4f8d693852776ae08a8d8e
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
|
||
|
|
489d25a567 |
ANDROID: cgroups: add v2 freezer ABI changes
introduce the freezer_state struct to be used by the v2 freezer backports Test: built and booted Bug: 163547360 Signed-off-by: Marco Ballesio <balejs@google.com> Change-Id: I4705ee9787f35db58e27de339d5264d1cd45007a |
||
|
|
0e5ea532a0 |
ANDROID: cgroups: ABI padding
ABI padding in struct cgroup Bug: 163547360 Test: built and booted the kernel Change-Id: Ie6ef8bdc4a62f57039d3b456cf125db4582b255a Signed-off-by: Marco Ballesio <balejs@google.com> |
||
|
|
83f60b3043 |
ANDROID: GKI: preserve ABI for struct sock_cgroup_data
In commit ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for
sk_clone_lock()") the struct sock_cgroup_data fields are changed a bit,
in a way that keeps the same size and functionality, it just packs
another bit into the structure.
Because this does not really change the abi, tell the genksyms detector
that nothing has changed so that the ABI checker is happy.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ibc748616140ac0da69b04699cfd2322dc4e5d1f4
Signed-off-by: Will McVicker <willmcvicker@google.com>
|
||
|
|
b41585fc93 |
Merge 4.19.134 into android-4.19-stable
Changes in 4.19.134
perf: Make perf able to build with latest libbfd
net: rmnet: fix lower interface leak
genetlink: remove genl_bind
ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg
l2tp: remove skb_dst_set() from l2tp_xmit_skb()
llc: make sure applications use ARPHRD_ETHER
net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb
net_sched: fix a memory leak in atm_tc_init()
net: usb: qmi_wwan: add support for Quectel EG95 LTE modem
tcp: fix SO_RCVLOWAT possible hangs under high mem pressure
tcp: make sure listeners don't initialize congestion-control state
tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()
tcp: md5: do not send silly options in SYNCOOKIES
tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers
tcp: md5: allow changing MD5 keys in all socket states
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
cgroup: Fix sock_cgroup_data on big-endian.
sched: consistently handle layer3 header accesses in the presence of VLANs
vlan: consolidate VLAN parsing code and limit max parsing depth
drm/msm: fix potential memleak in error branch
drm/exynos: fix ref count leak in mic_pre_enable
m68k: nommu: register start of the memory with memblock
m68k: mm: fix node memblock init
arm64/alternatives: use subsections for replacement sequences
tpm_tis: extra chip->ops check on error path in tpm_tis_core_init
gfs2: read-only mounts should grab the sd_freeze_gl glock
i2c: eg20t: Load module automatically if ID matches
arm64/alternatives: don't patch up internal branches
iio:magnetometer:ak8974: Fix alignment and data leak issues
iio:humidity:hdc100x Fix alignment and data leak issues
iio: magnetometer: ak8974: Fix runtime PM imbalance on error
iio: mma8452: Add missed iio_device_unregister() call in mma8452_probe()
iio: pressure: zpa2326: handle pm_runtime_get_sync failure
iio:humidity:hts221 Fix alignment and data leak issues
iio:pressure:ms5611 Fix buffer element alignment
iio:health:afe4403 Fix timestamp alignment and prevent data leak.
spi: fix initial SPI_SR value in spi-fsl-dspi
spi: spi-fsl-dspi: Fix lockup if device is shutdown during SPI transfer
net: dsa: bcm_sf2: Fix node reference count
of: of_mdio: Correct loop scanning logic
Revert "usb/ohci-platform: Fix a warning when hibernating"
Revert "usb/xhci-plat: Set PM runtime as active on resume"
Revert "usb/ehci-platform: Set PM runtime as active on resume"
net: sfp: add support for module quirks
net: sfp: add some quirks for GPON modules
HID: quirks: Remove ITE 8595 entry from hid_have_special_driver
ARM: at91: pm: add quirk for sam9x60's ulp1
scsi: sr: remove references to BLK_DEV_SR_VENDOR, leave it enabled
ALSA: usb-audio: Create a registration quirk for Kingston HyperX Amp (0951:16d8)
doc: dt: bindings: usb: dwc3: Update entries for disabling SS instances in park mode
mmc: sdhci: do not enable card detect interrupt for gpio cd type
ALSA: usb-audio: Rewrite registration quirk handling
ACPI: video: Use native backlight on Acer Aspire 5783z
ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Alpha S
Input: mms114 - add extra compatible for mms345l
ACPI: video: Use native backlight on Acer TravelMate 5735Z
ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Flight S
iio:health:afe4404 Fix timestamp alignment and prevent data leak.
phy: sun4i-usb: fix dereference of pointer phy0 before it is null checked
arm64: dts: meson: add missing gxl rng clock
spi: spi-sun6i: sun6i_spi_transfer_one(): fix setting of clock rate
usb: gadget: udc: atmel: fix uninitialized read in debug printk
staging: comedi: verify array index is correct before using it
Revert "thermal: mediatek: fix register index error"
ARM: dts: socfpga: Align L2 cache-controller nodename with dtschema
regmap: debugfs: Don't sleep while atomic for fast_io regmaps
copy_xstate_to_kernel: Fix typo which caused GDB regression
apparmor: ensure that dfa state tables have entries
perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode
soc: qcom: rpmh: Update dirty flag only when data changes
soc: qcom: rpmh: Invalidate SLEEP and WAKE TCSes before flushing new data
soc: qcom: rpmh-rsc: Clear active mode configuration for wake TCS
soc: qcom: rpmh-rsc: Allow using free WAKE TCS for active request
mtd: rawnand: marvell: Use nand_cleanup() when the device is not yet registered
mtd: rawnand: marvell: Fix probe error path
mtd: rawnand: timings: Fix default tR_max and tCCS_min timings
mtd: rawnand: brcmnand: fix CS0 layout
mtd: rawnand: oxnas: Keep track of registered devices
mtd: rawnand: oxnas: Unregister all devices on error
mtd: rawnand: oxnas: Release all devices in the _remove() path
slimbus: core: Fix mismatch in of_node_get/put
HID: magicmouse: do not set up autorepeat
HID: quirks: Always poll Obins Anne Pro 2 keyboard
HID: quirks: Ignore Simply Automated UPB PIM
ALSA: line6: Perform sanity check for each URB creation
ALSA: line6: Sync the pending work cancel at disconnection
ALSA: usb-audio: Fix race against the error recovery URB submission
ALSA: hda/realtek - change to suitable link model for ASUS platform
ALSA: hda/realtek - Enable Speaker for ASUS UX533 and UX534
USB: c67x00: fix use after free in c67x00_giveback_urb
usb: dwc2: Fix shutdown callback in platform
usb: chipidea: core: add wakeup support for extcon
usb: gadget: function: fix missing spinlock in f_uac1_legacy
USB: serial: iuu_phoenix: fix memory corruption
USB: serial: cypress_m8: enable Simply Automated UPB PIM
USB: serial: ch341: add new Product ID for CH340
USB: serial: option: add GosunCn GM500 series
USB: serial: option: add Quectel EG95 LTE modem
virt: vbox: Fix VBGL_IOCTL_VMMDEV_REQUEST_BIG and _LOG req numbers to match upstream
virt: vbox: Fix guest capabilities mask check
virtio: virtio_console: add missing MODULE_DEVICE_TABLE() for rproc serial
serial: mxs-auart: add missed iounmap() in probe failure and remove
ovl: inode reference leak in ovl_is_inuse true case.
ovl: relax WARN_ON() when decoding lower directory file handle
ovl: fix unneeded call to ovl_change_flags()
fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
Revert "zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()"
mei: bus: don't clean driver pointer
Input: i8042 - add Lenovo XiaoXin Air 12 to i8042 nomux list
uio_pdrv_genirq: fix use without device tree and no interrupt
timer: Prevent base->clk from moving backward
timer: Fix wheel index calculation on last level
MIPS: Fix build for LTS kernel caused by backporting lpj adjustment
riscv: use 16KB kernel stack on 64-bit
hwmon: (emc2103) fix unable to change fan pwm1_enable attribute
powerpc/book3s64/pkeys: Fix pkey_access_permitted() for execute disable pkey
intel_th: pci: Add Jasper Lake CPU support
intel_th: pci: Add Tiger Lake PCH-H support
intel_th: pci: Add Emmitsburg PCH support
intel_th: Fix a NULL dereference when hub driver is not loaded
dmaengine: fsl-edma: Fix NULL pointer exception in fsl_edma_tx_handler
misc: atmel-ssc: lock with mutex instead of spinlock
thermal/drivers/cpufreq_cooling: Fix wrong frequency converted from power
arm64: ptrace: Override SPSR.SS when single-stepping is enabled
arm64: ptrace: Consistently use pseudo-singlestep exceptions
arm64: compat: Ensure upper 32 bits of x0 are zero on syscall return
sched: Fix unreliable rseq cpu_id for new tasks
sched/fair: handle case of task_h_load() returning 0
genirq/affinity: Handle affinity setting on inactive interrupts correctly
printk: queue wake_up_klogd irq_work only if per-CPU areas are ready
libceph: don't omit recovery_deletes in target_copy()
rxrpc: Fix trace string
spi: sprd: switch the sequence of setting WDG_LOAD_LOW and _HIGH
Linux 4.19.134
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ieeb9e03f4a2d51aeebe3a3eadd9c1b93a26088a0
|
||
|
|
6d584c6e29 |
cgroup: Fix sock_cgroup_data on big-endian.
[ Upstream commit 14b032b8f8fce03a546dcf365454bec8c4a58d7d ]
In order for no_refcnt and is_data to be the lowest order two
bits in the 'val' we have to pad out the bitfield of the u8.
Fixes: ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
0505cc4c90 |
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
[ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ] When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit |
||
|
|
487e61785a |
Merge 4.19.66 into android-4.19
Changes in 4.19.66 scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure gcc-9: don't warn about uninitialized variable driver core: Establish order of operations for device_add and device_del via bitflag drivers/base: Introduce kill_device() libnvdimm/bus: Prevent duplicate device_unregister() calls libnvdimm/region: Register badblocks before namespaces libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock HID: wacom: fix bit shift for Cintiq Companion 2 HID: Add quirk for HP X1200 PIXART OEM mouse IB: directly cast the sockaddr union to aockaddr atm: iphase: Fix Spectre v1 vulnerability bnx2x: Disable multi-cos feature. ife: error out when nla attributes are empty ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6 ip6_tunnel: fix possible use-after-free on xmit ipip: validate header length in ipip_tunnel_xmit mlxsw: spectrum: Fix error path in mlxsw_sp_module_init() mvpp2: fix panic on module removal mvpp2: refactor MTU change code net: bridge: delete local fdb on device init failure net: bridge: mcast: don't delete permanent entries when fast leave is enabled net: fix ifindex collision during namespace removal net/mlx5e: always initialize frag->last_in_page net/mlx5: Use reversed order when unregister devices net: phylink: Fix flow control for fixed-link net: qualcomm: rmnet: Fix incorrect UL checksum offload logic net: sched: Fix a possible null-pointer dereference in dequeue_func() net sched: update vlan action for batched events operations net: sched: use temporary variable for actions indexes net/smc: do not schedule tx_work in SMC_CLOSED state NFC: nfcmrvl: fix gpio-handling regression ocelot: Cancel delayed work before wq destruction tipc: compat: allow tipc commands without arguments tun: mark small packets as owned by the tap sock net/mlx5: Fix modify_cq_in alignment net/mlx5e: Prevent encap flow counter update async to user query r8169: don't use MSI before RTL8168d compat_ioctl: pppoe: fix PPPOEIOCSFWD handling cgroup: Call cgroup_release() before __exit_signal() cgroup: Implement css_task_iter_skip() cgroup: Include dying leaders with live threads in PROCS iterations cgroup: css_task_iter_skip()'d iterators must be advanced before accessed cgroup: Fix css_task_iter_advance_css_set() cset skip condition spi: bcm2835: Fix 3-wire mode if DMA is enabled Linux 4.19.66 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id33ce169af8bf14a3791040b4cf923832ce84f6c |
||
|
|
4340d175b8 |
cgroup: Include dying leaders with live threads in PROCS iterations
commit c03cd7738a83b13739f00546166969342c8ff014 upstream. CSS_TASK_ITER_PROCS currently iterates live group leaders; however, this means that a process with dying leader and live threads will be skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't, which is confusing to say the least. Fix it by making cset track dying tasks and include dying leaders with live threads in PROCS iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
cab4399ebf |
Merge 4.19.47 into android-4.19
Changes in 4.19.47 x86: Hide the int3_emulate_call/jmp functions from UML ext4: do not delete unlinked inode from orphan list on failed truncate ext4: wait for outstanding dio during truncate in nojournal mode f2fs: Fix use of number of devices KVM: x86: fix return value for reserved EFER bio: fix improper use of smp_mb__before_atomic() sbitmap: fix improper use of smp_mb__before_atomic() Revert "scsi: sd: Keep disk read-only when re-reading partition" crypto: vmx - CTR: always increment IV as quadword mmc: sdhci-iproc: cygnus: Set NO_HISPD bit to fix HS50 data hold time problem mmc: sdhci-iproc: Set NO_HISPD bit to fix HS50 data hold time problem kvm: svm/avic: fix off-by-one in checking host APIC ID libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead arm64/kernel: kaslr: reduce module randomization range to 2 GB arm64/iommu: handle non-remapped addresses in ->mmap and ->get_sgtable gfs2: Fix sign extension bug in gfs2_update_stats btrfs: don't double unlock on error in btrfs_punch_hole Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path Btrfs: avoid fallback to transaction commit during fsync of files with holes Btrfs: fix race between ranged fsync and writeback of adjacent ranges btrfs: sysfs: Fix error path kobject memory leak btrfs: sysfs: don't leak memory when failing add fsid udlfb: fix some inconsistent NULL checking fbdev: fix divide error in fb_var_to_videomode NFSv4.2 fix unnecessary retry in nfs4_copy_file_range NFSv4.1 fix incorrect return value in copy_file_range bpf: add bpf_jit_limit knob to restrict unpriv allocations brcmfmac: assure SSID length from firmware is limited brcmfmac: add subtype check for event handling in data path arm64: errata: Add workaround for Cortex-A76 erratum #1463225 btrfs: honor path->skip_locking in backref code ovl: relax WARN_ON() for overlapping layers use case fbdev: fix WARNING in __alloc_pages_nodemask bug media: cpia2: Fix use-after-free in cpia2_exit media: serial_ir: Fix use-after-free in serial_ir_init_module media: vb2: add waiting_in_dqbuf flag media: vivid: use vfree() instead of kfree() for dev->bitmap_cap ssb: Fix possible NULL pointer dereference in ssb_host_pcmcia_exit bpf: devmap: fix use-after-free Read in __dev_map_entry_free batman-adv: mcast: fix multicast tt/tvlv worker locking at76c50x-usb: Don't register led_trigger if usb_register_driver failed acct_on(): don't mess with freeze protection Revert "btrfs: Honour FITRIM range constraints during free space trim" gfs2: Fix lru_count going negative cxgb4: Fix error path in cxgb4_init_module NFS: make nfs_match_client killable IB/hfi1: Fix WQ_MEM_RECLAIM warning gfs2: Fix occasional glock use-after-free mmc: core: Verify SD bus width tools/bpf: fix perf build error with uClibc (seen on ARC) selftests/bpf: set RLIMIT_MEMLOCK properly for test_libbpf_open.c bpftool: exclude bash-completion/bpftool from .gitignore pattern dmaengine: tegra210-dma: free dma controller in remove() net: ena: gcc 8: fix compilation warning hv_netvsc: fix race that may miss tx queue wakeup Bluetooth: Ignore CC events not matching the last HCI command pinctrl: zte: fix leaked of_node references ASoC: Intel: kbl_da7219_max98357a: Map BTN_0 to KEY_PLAYPAUSE usb: dwc2: gadget: Increase descriptors count for ISOC's usb: dwc3: move synchronize_irq() out of the spinlock protected block ASoC: hdmi-codec: unlock the device on startup errors powerpc/perf: Return accordingly on invalid chip-id in powerpc/boot: Fix missing check of lseek() return value powerpc/perf: Fix loop exit condition in nest_imc_event_init ASoC: imx: fix fiq dependencies spi: pxa2xx: fix SCR (divisor) calculation brcm80211: potential NULL dereference in brcmf_cfg80211_vndr_cmds_dcmd_handler() ACPI / property: fix handling of data_nodes in acpi_get_next_subnode() drm/nouveau/bar/nv50: ensure BAR is mapped media: stm32-dcmi: return appropriate error codes during probe ARM: vdso: Remove dependency with the arch_timer driver internals arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variable powerpc/watchdog: Use hrtimers for per-CPU heartbeat sched/cpufreq: Fix kobject memleak scsi: qla2xxx: Fix a qla24xx_enable_msix() error path scsi: qla2xxx: Fix abort handling in tcm_qla2xxx_write_pending() scsi: qla2xxx: Avoid that lockdep complains about unsafe locking in tcm_qla2xxx_close_session() scsi: qla2xxx: Fix hardirq-unsafe locking x86/modules: Avoid breaking W^X while loading modules Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve btrfs: fix panic during relocation after ENOSPC before writeback happens btrfs: Don't panic when we can't find a root key iwlwifi: pcie: don't crash on invalid RX interrupt rtc: 88pm860x: prevent use-after-free on device remove rtc: stm32: manage the get_irq probe defer case scsi: qedi: Abort ep termination if offload not scheduled s390/kexec_file: Fix detection of text segment in ELF loader sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs w1: fix the resume command API s390: qeth: address type mismatch warning dmaengine: pl330: _stop: clear interrupt status mac80211/cfg80211: update bss channel on channel switch libbpf: fix samples/bpf build failure due to undefined UINT32_MAX slimbus: fix a potential NULL pointer dereference in of_qcom_slim_ngd_register ASoC: fsl_sai: Update is_slave_mode with correct value mwifiex: prevent an array overflow rsi: Fix NULL pointer dereference in kmalloc net: cw1200: fix a NULL pointer dereference nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE nvme-rdma: fix a NULL deref when an admin connect times out crypto: sun4i-ss - Fix invalid calculation of hash end bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set bcache: return error immediately in bch_journal_replay() bcache: fix failure in journal relplay bcache: add failure check to run_cache_set() for journal replay bcache: avoid clang -Wunintialized warning RDMA/cma: Consider scope_id while binding to ipv6 ll address vfio-ccw: Do not call flush_workqueue while holding the spinlock vfio-ccw: Release any channel program when releasing/removing vfio-ccw mdev x86/build: Move _etext to actual end of .text smpboot: Place the __percpu annotation correctly x86/mm: Remove in_nmi() warning from 64-bit implementation of vmalloc_fault() mm/uaccess: Use 'unsigned long' to placate UBSAN warnings on older GCC versions Bluetooth: hci_qca: Give enough time to ROME controller to bootup. HID: logitech-hidpp: use RAP instead of FAP to get the protocol version pinctrl: pistachio: fix leaked of_node references pinctrl: samsung: fix leaked of_node references clk: rockchip: undo several noc and special clocks as critical on rk3288 perf/arm-cci: Remove broken race mitigation dmaengine: at_xdmac: remove BUG_ON macro in tasklet media: coda: clear error return value before picture run media: ov6650: Move v4l2_clk_get() to ov6650_video_probe() helper media: au0828: stop video streaming only when last user stops media: ov2659: make S_FMT succeed even if requested format doesn't match audit: fix a memory leak bug media: stm32-dcmi: fix crash when subdev do not expose any formats media: au0828: Fix NULL pointer dereference in au0828_analog_stream_enable() media: pvrusb2: Prevent a buffer overflow iio: adc: stm32-dfsdm: fix unmet direct dependencies detected block: fix use-after-free on gendisk powerpc/numa: improve control of topology updates powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX random: fix CRNG initialization when random.trust_cpu=1 random: add a spinlock_t to struct batched_entropy cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock sched/core: Check quota and period overflow at usec to nsec conversion sched/rt: Check integer overflow at usec to nsec conversion sched/core: Handle overflow in cpu_shares_write_u64 staging: vc04_services: handle kzalloc failure drm/msm: a5xx: fix possible object reference leak irq_work: Do not raise an IPI when queueing work on the local CPU thunderbolt: Take domain lock in switch sysfs attribute callbacks s390/qeth: handle error from qeth_update_from_chp_desc() USB: core: Don't unbind interfaces following device reset failure x86/irq/64: Limit IST stack overflow check to #DB stack drm: etnaviv: avoid DMA API warning when importing buffers phy: sun4i-usb: Make sure to disable PHY0 passby for peripheral mode phy: mapphone-mdm6600: add gpiolib dependency i40e: Able to add up to 16 MAC filters on an untrusted VF i40e: don't allow changes to HW VLAN stripping on active port VLANs ACPI/IORT: Reject platform device creation on NUMA node mapping failure arm64: vdso: Fix clock_getres() for CLOCK_REALTIME RDMA/cxgb4: Fix null pointer dereference on alloc_skb failure perf/x86/msr: Add Icelake support perf/x86/intel/rapl: Add Icelake support perf/x86/intel/cstate: Add Icelake support hwmon: (vt1211) Use request_muxed_region for Super-IO accesses hwmon: (smsc47m1) Use request_muxed_region for Super-IO accesses hwmon: (smsc47b397) Use request_muxed_region for Super-IO accesses hwmon: (pc87427) Use request_muxed_region for Super-IO accesses hwmon: (f71805f) Use request_muxed_region for Super-IO accesses scsi: libsas: Do discovery on empty PHY to update PHY info mmc: core: make pwrseq_emmc (partially) support sleepy GPIO controllers mmc_spi: add a status check for spi_sync_locked mmc: sdhci-of-esdhc: add erratum eSDHC5 support mmc: sdhci-of-esdhc: add erratum A-009204 support mmc: sdhci-of-esdhc: add erratum eSDHC-A001 and A-008358 support drm/amdgpu: fix old fence check in amdgpu_fence_emit PM / core: Propagate dev->power.wakeup_path when no callbacks clk: rockchip: Fix video codec clocks on rk3288 extcon: arizona: Disable mic detect if running when driver is removed clk: rockchip: Make rkpwm a critical clock on rk3288 s390: zcrypt: initialize variables before_use x86/microcode: Fix the ancient deprecated microcode loading method s390/mm: silence compiler warning when compiling without CONFIG_PGSTE s390: cio: fix cio_irb declaration selftests: cgroup: fix cleanup path in test_memcg_subtree_control() qmi_wwan: Add quirk for Quectel dynamic config cpufreq: ppc_cbe: fix possible object reference leak cpufreq/pasemi: fix possible object reference leak cpufreq: pmac32: fix possible object reference leak cpufreq: kirkwood: fix possible object reference leak block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR x86/build: Keep local relocations with ld.lld drm/pl111: fix possible object reference leak iio: ad_sigma_delta: Properly handle SPI bus locking vs CS assertion iio: hmc5843: fix potential NULL pointer dereferences iio: common: ssp_sensors: Initialize calculated_time in ssp_common_process_data iio: adc: ti-ads7950: Fix improper use of mlock selftests/bpf: ksym_search won't check symbols exists rtlwifi: fix a potential NULL pointer dereference mwifiex: Fix mem leak in mwifiex_tm_cmd brcmfmac: fix missing checks for kmemdup b43: shut up clang -Wuninitialized variable warning brcmfmac: convert dev_init_lock mutex to completion brcmfmac: fix WARNING during USB disconnect in case of unempty psq brcmfmac: fix race during disconnect when USB completion is in progress brcmfmac: fix Oops when bringing up interface during USB disconnect rtc: xgene: fix possible race condition rtlwifi: fix potential NULL pointer dereference scsi: ufs: Fix regulator load and icc-level configuration scsi: ufs: Avoid configuring regulator with undefined voltage range drm/panel: otm8009a: Add delay at the end of initialization arm64: cpu_ops: fix a leaked reference by adding missing of_node_put wil6210: fix return code of wmi_mgmt_tx and wmi_mgmt_tx_ext x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP x86/uaccess, signal: Fix AC=1 bloat x86/ia32: Fix ia32_restore_sigcontext() AC leak x86/uaccess: Fix up the fixup chardev: add additional check for minor range overlap RDMA/hns: Fix bad endianess of port_pd variable sh: sh7786: Add explicit I/O cast to sh7786_mm_sel() HID: core: move Usage Page concatenation to Main item ASoC: eukrea-tlv320: fix a leaked reference by adding missing of_node_put ASoC: fsl_utils: fix a leaked reference by adding missing of_node_put cxgb3/l2t: Fix undefined behaviour HID: logitech-hidpp: change low battery level threshold from 31 to 30 percent spi: tegra114: reset controller on probe kobject: Don't trigger kobject_uevent(KOBJ_REMOVE) twice. media: video-mux: fix null pointer dereferences media: wl128x: prevent two potential buffer overflows media: gspca: Kill URBs on USB device disconnect efifb: Omit memory map check on legacy boot thunderbolt: property: Fix a missing check of kzalloc thunderbolt: Fix to check the return value of kmemdup timekeeping: Force upper bound for setting CLOCK_REALTIME scsi: qedf: Add missing return in qedf_post_io_req() in the fcport offload check virtio_console: initialize vtermno value for ports tty: ipwireless: fix missing checks for ioremap overflow: Fix -Wtype-limits compilation warnings x86/mce: Fix machine_check_poll() tests for error types rcutorture: Fix cleanup path for invalid torture_type strings x86/mce: Handle varying MCA bank counts rcuperf: Fix cleanup path for invalid perf_type strings usb: core: Add PM runtime calls to usb_hcd_platform_shutdown scsi: qla4xxx: avoid freeing unallocated dma memory scsi: lpfc: avoid uninitialized variable warning selinux: avoid uninitialized variable warning batman-adv: allow updating DAT entry timeouts on incoming ARP Replies dmaengine: tegra210-adma: use devm_clk_*() helpers hwrng: omap - Set default quality thunderbolt: Fix to check return value of ida_simple_get thunderbolt: Fix to check for kmemdup failure drm/amd/display: fix releasing planes when exiting odm thunderbolt: property: Fix a NULL pointer dereference e1000e: Disable runtime PM on CNP+ tinydrm/mipi-dbi: Use dma-safe buffers for all SPI transfers igb: Exclude device from suspend direct complete optimization media: si2165: fix a missing check of return value media: dvbsky: Avoid leaking dvb frontend media: m88ds3103: serialize reset messages in m88ds3103_set_frontend media: staging: davinci_vpfe: disallow building with COMPILE_TEST drm/amd/display: Fix Divide by 0 in memory calculations drm/amd/display: Set stream->mode_changed when connectors change scsi: ufs: fix a missing check of devm_reset_control_get media: vimc: stream: fix thread state before sleep media: gspca: do not resubmit URBs when streaming has stopped media: go7007: avoid clang frame overflow warning with KASAN media: vimc: zero the media_device on probe scsi: lpfc: Fix FDMI manufacturer attribute value scsi: lpfc: Fix fc4type information for FDMI media: saa7146: avoid high stack usage with clang scsi: lpfc: Fix SLI3 commands being issued on SLI4 devices spi : spi-topcliff-pch: Fix to handle empty DMA buffers drm/omap: dsi: Fix PM for display blank with paired dss_pll calls spi: rspi: Fix sequencer reset during initialization spi: imx: stop buffer overflow in RX FIFO flush spi: Fix zero length xfer bug ASoC: davinci-mcasp: Fix clang warning without CONFIG_PM drm/v3d: Handle errors from IRQ setup. drm/drv: Hold ref on parent device during drm_device lifetime drm: Wake up next in drm_read() chain if we are forced to putback the event drm/sun4i: dsi: Change the start delay calculation vfio-ccw: Prevent quiesce function going into an infinite loop drm/sun4i: dsi: Enforce boundaries on the start delay NFS: Fix a double unlock from nfs_match,get_client Linux 4.19.47 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
4e4d5cea79 |
cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
[ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ] The number of descendant cgroups and the number of dying descendant cgroups are currently synchronized using the cgroup_mutex. The number of descendant cgroups will be required by the cgroup v2 freezer, which will use it to determine if a cgroup is frozen (depending on total number of descendants and number of frozen descendants). It's not always acceptable to grab the cgroup_mutex, especially from quite hot paths (e.g. exit()). To avoid this, let's additionally synchronize these counters using the css_set_lock. So, it's safe to read these counters with either cgroup_mutex or css_set_lock locked, and for changing both locks should be acquired. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
d885da678e |
Merge 4.19.34 into android-4.19
Changes in 4.19.34 arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals ext4: cleanup bh release code in ext4_ind_remove_space() tty/serial: atmel: Add is_half_duplex helper tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped CIFS: fix POSIX lock leak and invalid ptr deref h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux- f2fs: fix to adapt small inline xattr space in __find_inline_xattr() f2fs: fix to avoid deadlock in f2fs_read_inline_dir() tracing: kdb: Fix ftdump to not sleep net/mlx5: Avoid panic when setting vport rate net/mlx5: Avoid panic when setting vport mac, getting vport config gpio: gpio-omap: fix level interrupt idling include/linux/relay.h: fix percpu annotation in struct rchan sysctl: handle overflow for file-max net: stmmac: Avoid sometimes uninitialized Clang warnings enic: fix build warning without CONFIG_CPUMASK_OFFSTACK libbpf: force fixdep compilation at the start of the build scsi: hisi_sas: Set PHY linkrate when disconnected scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver x86/hyperv: Fix kernel panic when kexec on HyperV perf c2c: Fix c2c report for empty numa node mm/sparse: fix a bad comparison mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm, swap: bounds check swap_info array accesses to avoid NULL derefs mm,oom: don't kill global init via memory.oom.group memcg: killed threads should not invoke memcg OOM killer mm, mempolicy: fix uninit memory access mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! mm/slab.c: kmemleak no scan alien caches ocfs2: fix a panic problem caused by o2cb_ctl f2fs: do not use mutex lock in atomic context fs/file.c: initialize init_files.resize_wait page_poison: play nicely with KASAN cifs: use correct format characters dm thin: add sanity checks to thin-pool and external snapshot creation f2fs: fix to check inline_xattr_size boundary correctly cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED cifs: Fix NULL pointer dereference of devname netfilter: nf_tables: check the result of dereferencing base_chain->stats netfilter: conntrack: tcp: only close if RST matches exact sequence jbd2: fix invalid descriptor block checksum fs: fix guard_bio_eod to check for real EOD errors tools lib traceevent: Fix buffer overflow in arg_eval PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove() wil6210: check null pointer in _wil_cfg80211_merge_extra_ies mt76: fix a leaked reference by adding a missing of_node_put crypto: crypto4xx - add missing of_node_put after of_device_is_available crypto: cavium/zip - fix collision with generic cra_driver_name usb: chipidea: Grab the (legacy) USB PHY by phandle first powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc coresight: etm4x: Add support to enable ETMv4.2 serial: 8250_pxa: honor the port number from devicetree ARM: 8840/1: use a raw_spinlock_t in unwind iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback btrfs: qgroup: Make qgroup async transaction commit more aggressive mmc: omap: fix the maximum timeout setting net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat e1000e: Fix -Wformat-truncation warnings mlxsw: spectrum: Avoid -Wformat-truncation warnings platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN platform/mellanox: mlxreg-hotplug: Fix KASAN warning loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() IB/mlx4: Increase the timeout for CM cache clk: fractional-divider: check parent rate only if flag is set perf annotate: Fix getting source line failure ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of() cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies efi: cper: Fix possible out-of-bounds access s390/ism: ignore some errors during deregistration scsi: megaraid_sas: return error when create DMA pool failed scsi: fcoe: make use of fip_mode enum complete drm/amd/display: Clear stream->mode_changed after commit perf test: Fix failure of 'evsel-tp-sched' test on s390 mwifiex: don't advertise IBSS features without FW support perf report: Don't shadow inlined symbol with different addr range SoC: imx-sgtl5000: add missing put_device() media: ov7740: fix runtime pm initialization media: sh_veu: Correct return type for mem2mem buffer helpers media: s5p-jpeg: Correct return type for mem2mem buffer helpers media: rockchip/rga: Correct return type for mem2mem buffer helpers media: s5p-g2d: Correct return type for mem2mem buffer helpers media: mx2_emmaprp: Correct return type for mem2mem buffer helpers media: mtk-jpeg: Correct return type for mem2mem buffer helpers mt76: usb: do not run mt76u_queues_deinit twice xen/gntdev: Do not destroy context while dma-bufs are in use vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 HID: intel-ish-hid: avoid binding wrong ishtp_cl_device cgroup, rstat: Don't flush subtree root unless necessary jbd2: fix race when writing superblock leds: lp55xx: fix null deref on firmware load failure perf report: Add s390 diagnosic sampling descriptor size iwlwifi: pcie: fix emergency path ACPI / video: Refactor and fix dmi_is_desktop() selftests: skip seccomp get_metadata test if not real root kprobes: Prohibit probing on bsearch() kprobes: Prohibit probing on RCU debug routine netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm ARM: 8833/1: Ensure that NEON code always compiles with Clang ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins ALSA: PCM: check if ops are defined before suspending PCM ath10k: fix shadow register implementation for WCN3990 usb: f_fs: Avoid crash due to out-of-scope stack ptr access sched/topology: Fix percpu data types in struct sd_data & struct s_data bcache: fix input overflow to cache set sysfs file io_error_halflife bcache: fix input overflow to sequential_cutoff bcache: fix potential div-zero error of writeback_rate_i_term_inverse bcache: improve sysfs_strtoul_clamp() genirq: Avoid summation loops for /proc/stat net: marvell: mvpp2: fix stuck in-band SGMII negotiation iw_cxgb4: fix srqidx leak during connection abort net: phy: consider latched link-down status in polling mode fbdev: fbmem: fix memory access if logo is bigger than the screen cdrom: Fix race condition in cdrom_sysctl_register drm: rcar-du: add missing of_node_put drm/amd/display: Don't re-program planes for DPMS changes drm/amd/display: Disconnect mpcc when changing tg perf/aux: Make perf_event accessible to setup_aux() e1000e: fix cyclic resets at link up with active tx e1000e: Exclude device from suspend direct complete optimization platform/x86: intel_pmc_core: Fix PCH IP sts reading i2c: of: Try to find an I2C adapter matching the parent staging: spi: mt7621: Add return code check on device_reset() iwlwifi: mvm: fix RFH config command with >=10 CPUs ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK efi/memattr: Don't bail on zero VA if it equals the region's PA sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() drm/vkms: Bugfix extra vblank frame ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted soc: qcom: gsbi: Fix error handling in gsbi_probe() mt7601u: bump supported EEPROM version ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of ARM: avoid Cortex-A9 livelock on tight dmb loops block, bfq: fix in-service-queue check for queue merging bpf: fix missing prototype warnings selftests/bpf: skip verifier tests for unsupported program types powerpc/64s: Clear on-stack exception marker upon exception return cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state tty: increase the default flip buffer limit to 2*640K powerpc/pseries: Perform full re-add of CPU for topology update post-migration drm/amd/display: Enable vblank interrupt during CRC capture ALSA: dice: add support for Solid State Logic Duende Classic/Mini usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded platform/x86: intel-hid: Missing power button release on some Dell models perf script python: Use PyBytes for attr in trace-event-python perf script python: Add trace_context extension module to sys.modules media: mt9m111: set initial frame size other than 0x0 hwrng: virtio - Avoid repeated init of completion soc/tegra: fuse: Fix illegal free of IO base address HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit f2fs: UBSAN: set boolean value iostat_enable correctly hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable cpu/hotplug: Mute hotplug lockdep during init dmaengine: imx-dma: fix warning comparison of distinct pointer types dmaengine: qcom_hidma: assign channel cookie correctly dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_* netfilter: physdev: relax br_netfilter dependency media: rcar-vin: Allow independent VIN link enablement media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins drm: Auto-set allow_fb_modifiers when given modifiers at plane init drm/nouveau: Stop using drm_crtc_force_disable x86/build: Specify elf_i386 linker emulation explicitly for i386 objects selinux: do not override context on context mounts brcmfmac: Use firmware_request_nowarn for the clm_blob wlcore: Fix memory leak in case wl12xx_fetch_firmware failure x86/build: Mark per-CPU symbols as absolute explicitly for LLD drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup clk: meson: clean-up clock registration clk: rockchip: fix frac settings of GPLL clock for rk3328 dmaengine: tegra: avoid overflow of byte tracking Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers net: stmmac: Avoid one more sometimes uninitialized Clang warning ACPI / video: Extend chassis-type detection with a "Lunch Box" check bcache: fix potential div-zero error of writeback_rate_p_term_inverse kprobes/x86: Blacklist non-attachable interrupt functions Linux 4.19.34 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
d0bc74c563 |
cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
[ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]
The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.
However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does
for (;;) {
int pid = fork();
assert(pid >= 0);
if (pid)
wait(NULL);
else
exit(0);
}
can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().
Test-case:
mkdir -p /tmp/CG
mount -t cgroup2 none /tmp/CG
echo '+pids' > /tmp/CG/cgroup.subtree_control
mkdir /tmp/CG/PID
echo 2 > /tmp/CG/PID/pids.max
perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
echo $! > /tmp/CG/PID/cgroup.procs
Without this patch the forking process fails soon after migration.
Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).
Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
9f79143ebb |
UPSTREAM: kernel: cgroup: add poll file operation
Cgroup has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (cherry picked from commit: dc50537bdd1a0804fa2cbc990565ee9a944e66fa) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Idc648e7b7b7bd5fc00c7b32163e55a93b0f49a98 Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
||
|
|
dc9cd29ded |
UPSTREAM: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 2ce7135adc9ad081aa3c49744144376ac74fea60) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
||
|
|
479adb89a9 |
cgroup: Fix dom_cgrp propagation when enabling threaded mode
A cgroup which is already a threaded domain may be converted into a
threaded cgroup if the prerequisite conditions are met. When this
happens, all threaded descendant should also have their ->dom_cgrp
updated to the new threaded domain cgroup. Unfortunately, this
propagation was missing leading to the following failure.
# cd /sys/fs/cgroup/unified
# cat cgroup.subtree_control # show that no controllers are enabled
# mkdir -p mycgrp/a/b/c
# echo threaded > mycgrp/a/b/cgroup.type
At this point, the hierarchy looks as follows:
mycgrp [d]
a [dt]
b [t]
c [inv]
Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
# echo threaded > mycgrp/a/cgroup.type
By this point, we now have a hierarchy that looks as follows:
mycgrp [dt]
a [t]
b [t]
c [inv]
But, when we try to convert the node "c" from "domain invalid" to
"threaded", we get ENOTSUP on the write():
# echo threaded > mycgrp/a/b/c/cgroup.type
sh: echo: write error: Operation not supported
This patch fixes the problem by
* Moving the opencoded ->dom_cgrp save and restoration in
cgroup_enable_threaded() into cgroup_{save|restore}_control() so
that mulitple cgroups can be handled.
* Updating all threaded descendants' ->dom_cgrp to point to the new
dom_cgrp when enabling threaded mode.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Reported-by: Amin Jamali <ajamali@pivotal.io>
Reported-by: Joao De Almeida Pereira <jpereira@pivotal.io>
Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
Fixes:
|
||
|
|
d09d8df3a2 |
blkcg: add generic throttling mechanism
Since IO can be issued from literally anywhere it's almost impossible to do throttling without having some sort of adverse effect somewhere else in the system because of locking or other dependencies. The best way to solve this is to do the throttling when we know we aren't holding any other kernel resources. Do this by tracking throttling in a per-blkg basis, and if we require throttling flag the task that it needs to check before it returns to user space and possibly sleep there. This is to address the case where a process is doing work that is generating IO that can't be throttled, whether that is directly with a lot of REQ_META IO, or indirectly by allocating so much memory that it is swamping the disk with REQ_SWAP. We can't use task_add_work as we don't want to induce a memory allocation in the IO path, so simply saving the request queue in the task and flagging it to do the notify_resume thing achieves the same result without the overhead of a memory allocation. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
8f53470bab |
cgroup: Add cgroup_subsys->css_rstat_flush()
This patch adds cgroup_subsys->css_rstat_flush(). If a subsystem has this callback, its csses are linked on cgrp->css_rstat_list and rstat will call the function whenever the associated cgroup is flushed. Flush is also performed when such csses are released so that residual counts aren't lost. Combined with the rstat API previous patches factored out, this allows controllers to plug into rstat to manage their statistics in a scalable way. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
d4ff749b5e |
cgroup: Distinguish base resource stat implementation from rstat
Base resource stat accounts universial (not specific to any controller) resource consumptions on top of rstat. Currently, its implementation is intermixed with rstat implementation making the code confusing to follow. This patch clarifies the distintion by doing the followings. * Encapsulate base resource stat counters, currently only cputime, in struct cgroup_base_stat. * Move prev_cputime into struct cgroup and initialize it with cgroup. * Rename the related functions so that they start with cgroup_base_stat. * Prefix the related variables and field names with b. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
c58632b363 |
cgroup: Rename stat to rstat
stat is too generic a name and ends up causing subtle confusions. It'll be made generic so that controllers can plug into it, which will make the problem worse. Let's rename it to something more specific - cgroup_rstat for cgroup recursive stat. This patch does the following renames. No other changes. * cpu_stat -> rstat_cpu * stat -> rstat * ?cstat -> ?rstatc Note that the renames are selective. The unrenamed are the ones which implement basic resource statistics on top of rstat. This will be further cleaned up in the following patches. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
b12e358328 |
cgroup: Limit event generation frequency
".events" files generate file modified event to notify userland of possible new events. Some of the events can be quite bursty (e.g. memory high event) and generating notification each time is costly and pointless. This patch implements a event rate limit mechanism. If a new notification is requested before 10ms has passed since the previous notification, the new notification is delayed till then. As this only delays from the second notification on in a given close cluster of notifications, userland reactions to notifications shouldn't be delayed at all in most cases while avoiding notification storms. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
d92cd810e6 |
Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo: "rcu_work addition and a couple trivial changes" * 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: remove the comment about the old manager_arb mutex workqueue: fix the comments of nr_idle fs/aio: Use rcu_work instead of explicit rcu and work item cgroup: Use rcu_work instead of explicit rcu and work item RCU, workqueue: Implement rcu_work |
||
|
|
8f36aaec9c |
cgroup: Use rcu_work instead of explicit rcu and work item
Workqueue now has rcu_work. Use it instead of open-coding rcu -> work item bouncing. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
4dcb31d464 |
net: use skb_to_full_sk() in skb_update_prio()
Andrei Vagin reported a KASAN: slab-out-of-bounds error in
skb_update_prio()
Since SYNACK might be attached to a request socket, we need to
get back to the listener socket.
Since this listener is manipulated without locks, add const
qualifiers to sock_cgroup_prioidx() so that the const can also
be used in skb_update_prio()
Also add the const qualifier to sock_cgroup_classid() for consistency.
Fixes:
|
||
|
|
392536b731 |
cgroup: Update documentation reference
The cgroup_subsys structure references a documentation file that has been renamed after the v1/v2 split. Since the v2 documentation doesn't currently contain any information on kernel interfaces for controllers, point the user to the v1 docs. Cc: Tejun Heo <tj@kernel.org> Cc: linux-doc@vger.kernel.org Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
22714a2ba4 |
Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Cgroup2 cpu controller support is finally merged.
- Basic cpu statistics support to allow monitoring by default without
the CPU controller enabled.
- cgroup2 cpu controller support.
- /sys/kernel/cgroup files to help dealing with new / optional
features"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: export list of cgroups v2 features using sysfs
cgroup: export list of delegatable control files using sysfs
cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
MAINTAINERS: relocate cpuset.c
cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
sched: Implement interface for cgroup unified hierarchy
sched: Misc preps for cgroup unified hierarchy interface
sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cgroup: statically initialize init_css_set->dfl_cgrp
cgroup: Implement cgroup2 basic CPU usage accounting
cpuacct: Introduce cgroup_account_cputime[_field]()
sched/cputime: Expose cputime_adjust()
|
||
|
|
b24413180f |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
d41bf8c9de |
cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
The basic cpu stat is currently shown with "cpu." prefix in cgroup.stat, and the same information is duplicated in cpu.stat when cpu controller is enabled. This is ugly and not very scalable as we want to expand the coverage of stat information which is always available. This patch makes cgroup core always create "cpu.stat" file and show the basic cpu stat there and calls the cpu controller to show the extra stats when enabled. This ensures that the same information isn't presented in multiple places and makes future expansion of basic stats easier. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> |
||
|
|
041cd640b2 |
cgroup: Implement cgroup2 basic CPU usage accounting
In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on. cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.
cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default. While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense. Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.
This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.
* All accountings are done per-cpu and don't propagate immediately.
It just bumps the per-cgroup per-cpu counters and links to the
parent's updated list if not already on it.
* On a read, the per-cpu counters are collected into the global ones
and then propagated upwards. Only the per-cpu counters which have
changed since the last read are propagated.
* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
prefix. Total usage is collected from scheduling events. User/sys
breakdown is sourced from tick sampling and adjusted to the usage
using cputime_adjust().
This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).
v2: Minor changes and documentation updates as suggested by Waiman and
Roman.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
|
||
|
|
e1cba4b85d |
cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs filesystem to enable cpuset controller to use v2 behavior in a v1 cgroup. This mount option applies only to cpuset controller and have no effect on other controllers. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
1a926e0bba |
cgroup: implement hierarchy limits
Creating cgroup hierearchies of unreasonable size can affect overall system performance. A user might want to limit the size of cgroup hierarchy. This is especially important if a user is delegating some cgroup sub-tree. To address this issue, introduce an ability to control the size of cgroup hierarchy. The cgroup.max.descendants control file allows to set the maximum allowed number of descendant cgroups. The cgroup.max.depth file controls the maximum depth of the cgroup tree. Both are single value r/w files, with "max" default value. The control files exist on each hierarchy level (including root). When a new cgroup is created, we check the total descendants and depth limits on each level, and if none of them are exceeded, a new cgroup is created. Only alive cgroups are counted, removed (dying) cgroups are ignored. Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org |
||
|
|
0679dee03c |
cgroup: keep track of number of descent cgroups
Keep track of the number of online and dying descent cgroups. This data will be used later to add an ability to control cgroup hierarchy (limit the depth and the number of descent cgroups) and display hierarchy stats. Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org |
||
|
|
8cfd8147df |
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
|
||
|
|
454000adaa |
cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
cgroup v2 is in the process of growing thread granularity support. A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.
The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged. Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.
This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.
* cgroup->dom_cgrp points to self for normal and thread roots. For
proper thread subtree members, points to the dom_cgrp (the thread
root).
* css_set->dom_cset points to self if for normal and thread roots. If
threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
The dom_cgrp serves as the resource domain and keeps the matching
csses available. The dom_cset holds those csses and makes them
easily accessible.
* All threaded csets are linked on their dom_csets to enable iteration
of all threaded tasks.
* cgroup->nr_threaded_children keeps track of the number of threaded
children.
This patch adds the above but doesn't actually use them yet. The
following patches will build on top.
v4: ->nr_threaded_children added.
v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new
enable-threaded-per-cgroup behavior.
v2: Added cgroup_is_threaded() helper.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
788b950c62 |
cgroup: distinguish local and children populated states
cgrp->populated_cnt counts both local (the cgroup's populated css_sets) and subtree proper (populated children) so that it's only zero when the whole subtree, including self, is empty. This patch splits the counter into two so that local and children populated states are tracked separately. It allows finer-grained tests on the state of the hierarchy which will be used to replace css_set walking local populated test. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
5136f6365c |
cgroup: implement "nsdelegate" mount option
Currently, cgroup only supports delegation to !root users and cgroup namespaces don't get any special treatments. This limits the usefulness of cgroup namespaces as they by themselves can't be safe delegation boundaries. A process inside a cgroup can change the resource control knobs of the parent in the namespace root and may move processes in and out of the namespace if cgroups outside its namespace are visible somehow. This patch adds a new mount option "nsdelegate" which makes cgroup namespaces delegation boundaries. If set, cgroup behaves as if write permission based delegation took place at namespace boundaries - writes to the resource control knobs from the namespace root are denied and migration crossing the namespace boundary aren't allowed from inside the namespace. This allows cgroup namespace to function as a delegation boundary by itself. v2: Silently ignore nsdelegate specified on !init mounts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aravind Anbudurai <aru7@fb.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com> |
||
|
|
73a7242a06 |
cgroup: Keep accurate count of tasks in each css_set
The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable nr_tasks is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.
tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
Refreshed on top of cgroup/for-v4.13 which dropped on
css_set_populated() -> nr_tasks conversion.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
33c35aa481 |
cgroup: Prevent kill_css() from being called more than once
The kill_css() function may be called more than once under the condition that the css was killed but not physically removed yet followed by the removal of the cgroup that is hosting the css. This patch prevents any harmm from being done when that happens. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.5+ |
||
|
|
b8b1a2e5ec |
cgroup: move cgroup_subsys_state parent field for cache locality
Various structures embed a struct cgroup_subsys_state, typically at the top of the containing structure. It is common for code that accesses the structures to perform operations that iterate over the chain of parent css pointers, also accessing data in each containing structure. In particular, struct cpuacct is used by fairly hot code paths in the scheduler such as cpuacct_charge(). Move the parent css pointer field to the end of the structure to increase the chances of residing in the same cache line as the data from the containing structure. Signed-off-by: Todd Poynor <toddpoynor@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
4b9502e63b |
kernel: convert css_set.refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
780de9dd27 |
sched/headers, cgroups: Remove the threadgroup_change_*() wrappery
threadgroup_change_begin()/end() is a pointless wrapper around cgroup_threadgroup_change_begin()/end(), minus a might_sleep() in the !CONFIG_CGROUPS=y case. Remove the wrappery, move the might_sleep() (the down_read() already has a might_sleep() check). This debloats <linux/sched.h> a bit and simplifies this API. Update all call sites. No change in functionality. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
5f617ebbdf |
cgroup: reorder css_set fields
Reorder css_set fields so that they're roughly in the order of how hot they are. The rough order is 1. the actual csses 2. reference counter and the default cgroup pointer. 3. task lists and iterations 4. fields used during merge including css_set lookup 5. the rest Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com> |
||
|
|
e90cbebc3f |
cgroup add cftype->open/release() callbacks
Pipe the newly added kernfs->open/release() callbacks through cftype. While at it, as cleanup operations now can be performed from ->release() instead of ->seq_stop(), make the latter optional. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com> |
||
|
|
3007098494 |
cgroup: add support for eBPF programs
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.
To illustrate the logic behind that, assume the following example
cgroup hierarchy.
A - B - C
\ D - E
If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.
Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
||
|
|
5cf1cacb49 |
cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
Since
|
||
|
|
2b021cbf3c |
cgroup: ignore css_sets associated with dead cgroups during migration
Before |
||
|
|
f6d635ad34 |
cgroup: implement cgroup_subsys->implicit_on_dfl
Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control". For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.
This patch implements cgroup_subsys->implicit_on_dfl flag. When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.
An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.
v2: Reimplemented on top of the recent updates to css handling and
subsystem rebinding. Rebinding implicit subsystems is now a
simple matter of exempting it from the busy subsystem check.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
e4857982f4 |
cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
Migration can be multi-target on the default hierarchy when a controller is enabled - processes belonging to each child cgroup have to be moved to the child cgroup itself to refresh css association. This isn't a problem for cgroup_migrate_add_src() as each source css_set still maps to single source and target cgroups; however, cgroup_migrate_prepare_dst() is called once after all source css_sets are added and thus might not have a single destination cgroup. This is currently worked around by specifying NULL for @dst_cgrp and using the source's default cgroup as destination as the only multi-target migration in use is self-targetting. While this works, it's subtle and clunky. As all taget cgroups are already specified while preparing the source css_sets, this clunkiness can easily be removed by recording the target cgroup in each source css_set. This patch adds css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and used by cgroup_migrate_prepare_dst(). This also makes migration code ready for arbitrary multi-target migration. Signed-off-by: Tejun Heo <tj@kernel.org> |