ccea9767d3cc25f1540b4c2474238fbf32809666
389 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6369256a83 |
Merge 4.19.269 into android-4.19-stable
Changes in 4.19.269 arm: dts: rockchip: fix node name for hym8563 rtc ARM: dts: rockchip: fix ir-receiver node names ARM: 9251/1: perf: Fix stacktraces for tracepoint events in THUMB2 kernels ARM: 9266/1: mm: fix no-MMU ZERO_PAGE() implementation ARM: dts: rockchip: disable arm_global_timer on rk3066 and rk3188 9p/fd: Use P9_HDRSZ for header size ALSA: seq: Fix function prototype mismatch in snd_seq_expand_var_event ASoC: soc-pcm: Add NULL check in BE reparenting regulator: twl6030: fix get status of twl6032 regulators fbcon: Use kzalloc() in fbcon_prepare_logo() 9p/xen: check logical size for buffer size net: usb: qmi_wwan: add u-blox 0x1342 composition xen/netback: Ensure protocol headers don't fall in the non-linear area xen/netback: do some code cleanup xen/netback: don't call kfree_skb() with interrupts disabled rcutorture: Automatically create initrd directory media: v4l2-dv-timings.c: fix too strict blanking sanity checks memcg: fix possible use-after-free in memcg_write_event_control() KVM: s390: vsie: Fix the initialization of the epoch extension (epdx) field HID: hid-lg4ff: Add check for empty lbuf HID: core: fix shift-out-of-bounds in hid_report_raw_event ieee802154: cc2520: Fix error return code in cc2520_hw_init() ca8210: Fix crash by zero initializing data gpio: amd8111: Fix PCI device reference count leak e1000e: Fix TX dispatch condition igb: Allocate MSI-X vector when testing Bluetooth: 6LoWPAN: add missing hci_dev_put() in get_l2cap_conn() Bluetooth: Fix not cleanup led when bt_init fails selftests: rtnetlink: correct xfrm policy rule in kci_test_ipsec_offload mac802154: fix missing INIT_LIST_HEAD in ieee802154_if_add() net: encx24j600: Add parentheses to fix precedence net: encx24j600: Fix invalid logic in reading of MISTAT register xen-netfront: Fix NULL sring after live migration net: mvneta: Prevent out of bounds read in mvneta_config_rss() i40e: Fix not setting default xps_cpus after reset i40e: Fix for VF MAC address 0 i40e: Disallow ip4 and ip6 l4_4_bytes NFC: nci: Bounds check struct nfc_target arrays nvme initialize core quirks before calling nvme_init_subsystem net: stmmac: fix "snps,axi-config" node property parsing net: hisilicon: Fix potential use-after-free in hisi_femac_rx() net: hisilicon: Fix potential use-after-free in hix5hd2_rx() tipc: Fix potential OOB in tipc_link_proto_rcv() ethernet: aeroflex: fix potential skb leak in greth_init_rings() xen/netback: fix build warning net: plip: don't call kfree_skb/dev_kfree_skb() under spin_lock_irq() ipv6: avoid use-after-free in ip6_fragment() net: mvneta: Fix an out of bounds check can: esd_usb: Allow REC and TEC to return to zero Linux 4.19.269 Change-Id: Ie79e55bad6376d2314d44479ef1ec3c546d24030 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
e1ae97624e |
memcg: fix possible use-after-free in memcg_write_event_control()
commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream. memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to |
||
|
|
92c6dd6a65 |
BACKPORT: cgroup: make per-cgroup pressure stall tracking configurable
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This causes additional overhead with psi_avgs_work being called for each cgroup in the hierarchy. psi_avgs_work has been highly optimized, however on systems with large number of cgroups the overhead becomes noticeable. Systems which use PSI only at the system level could avoid this overhead if PSI can be configured to skip per-cgroup stall accounting. Add "cgroup_disable=pressure" kernel command-line option to allow requesting system-wide only pressure stall accounting. When set, it keeps system-wide accounting under /proc/pressure/ but skips accounting for individual cgroups and does not expose PSI nodes in cgroup hierarchy. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/patchwork/patch/1435705 (cherry picked from commit 3958e2d0c34e18c41b60dc01832bd670a59ef70f https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git tj) Conflicts: include/linux/cgroup-defs.h kernel/cgroup/cgroup.c 1. Trivial merge conflict in cgroup-defs.h due to missing CFTYPE_DEBUG 2. Changed flags to (CFTYPE_NOT_ON_ROOT | CFTYPE_PRESSURE) in cgroup.c because in 4.19 psi files were allowed only in non-root cgroups. Bug: 178872719 Bug: 191734423 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: Ifc8fbc52f9a1131d7c2668edbb44c525c76c3360 |
||
|
|
5318f3163c |
BACKPORT: cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa
(cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873)
cgroup-defs.h: use the struct cgroup_freezer_state and the
freezer field from definitions in I6221a975c04f06249a4f8d693852776ae08a8d8e
sched.h: use the frozen field defined in
I6221a975c04f06249a4f8d693852776ae08a8d8e
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
|
||
|
|
0e00bb1cf7 |
UPSTREAM: cgroup: Simplify cgroup_ancestor
Simplify cgroup_ancestor function. This is follow-up for
commit
|
||
|
|
b41585fc93 |
Merge 4.19.134 into android-4.19-stable
Changes in 4.19.134
perf: Make perf able to build with latest libbfd
net: rmnet: fix lower interface leak
genetlink: remove genl_bind
ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg
l2tp: remove skb_dst_set() from l2tp_xmit_skb()
llc: make sure applications use ARPHRD_ETHER
net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb
net_sched: fix a memory leak in atm_tc_init()
net: usb: qmi_wwan: add support for Quectel EG95 LTE modem
tcp: fix SO_RCVLOWAT possible hangs under high mem pressure
tcp: make sure listeners don't initialize congestion-control state
tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()
tcp: md5: do not send silly options in SYNCOOKIES
tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers
tcp: md5: allow changing MD5 keys in all socket states
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
cgroup: Fix sock_cgroup_data on big-endian.
sched: consistently handle layer3 header accesses in the presence of VLANs
vlan: consolidate VLAN parsing code and limit max parsing depth
drm/msm: fix potential memleak in error branch
drm/exynos: fix ref count leak in mic_pre_enable
m68k: nommu: register start of the memory with memblock
m68k: mm: fix node memblock init
arm64/alternatives: use subsections for replacement sequences
tpm_tis: extra chip->ops check on error path in tpm_tis_core_init
gfs2: read-only mounts should grab the sd_freeze_gl glock
i2c: eg20t: Load module automatically if ID matches
arm64/alternatives: don't patch up internal branches
iio:magnetometer:ak8974: Fix alignment and data leak issues
iio:humidity:hdc100x Fix alignment and data leak issues
iio: magnetometer: ak8974: Fix runtime PM imbalance on error
iio: mma8452: Add missed iio_device_unregister() call in mma8452_probe()
iio: pressure: zpa2326: handle pm_runtime_get_sync failure
iio:humidity:hts221 Fix alignment and data leak issues
iio:pressure:ms5611 Fix buffer element alignment
iio:health:afe4403 Fix timestamp alignment and prevent data leak.
spi: fix initial SPI_SR value in spi-fsl-dspi
spi: spi-fsl-dspi: Fix lockup if device is shutdown during SPI transfer
net: dsa: bcm_sf2: Fix node reference count
of: of_mdio: Correct loop scanning logic
Revert "usb/ohci-platform: Fix a warning when hibernating"
Revert "usb/xhci-plat: Set PM runtime as active on resume"
Revert "usb/ehci-platform: Set PM runtime as active on resume"
net: sfp: add support for module quirks
net: sfp: add some quirks for GPON modules
HID: quirks: Remove ITE 8595 entry from hid_have_special_driver
ARM: at91: pm: add quirk for sam9x60's ulp1
scsi: sr: remove references to BLK_DEV_SR_VENDOR, leave it enabled
ALSA: usb-audio: Create a registration quirk for Kingston HyperX Amp (0951:16d8)
doc: dt: bindings: usb: dwc3: Update entries for disabling SS instances in park mode
mmc: sdhci: do not enable card detect interrupt for gpio cd type
ALSA: usb-audio: Rewrite registration quirk handling
ACPI: video: Use native backlight on Acer Aspire 5783z
ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Alpha S
Input: mms114 - add extra compatible for mms345l
ACPI: video: Use native backlight on Acer TravelMate 5735Z
ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Flight S
iio:health:afe4404 Fix timestamp alignment and prevent data leak.
phy: sun4i-usb: fix dereference of pointer phy0 before it is null checked
arm64: dts: meson: add missing gxl rng clock
spi: spi-sun6i: sun6i_spi_transfer_one(): fix setting of clock rate
usb: gadget: udc: atmel: fix uninitialized read in debug printk
staging: comedi: verify array index is correct before using it
Revert "thermal: mediatek: fix register index error"
ARM: dts: socfpga: Align L2 cache-controller nodename with dtschema
regmap: debugfs: Don't sleep while atomic for fast_io regmaps
copy_xstate_to_kernel: Fix typo which caused GDB regression
apparmor: ensure that dfa state tables have entries
perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode
soc: qcom: rpmh: Update dirty flag only when data changes
soc: qcom: rpmh: Invalidate SLEEP and WAKE TCSes before flushing new data
soc: qcom: rpmh-rsc: Clear active mode configuration for wake TCS
soc: qcom: rpmh-rsc: Allow using free WAKE TCS for active request
mtd: rawnand: marvell: Use nand_cleanup() when the device is not yet registered
mtd: rawnand: marvell: Fix probe error path
mtd: rawnand: timings: Fix default tR_max and tCCS_min timings
mtd: rawnand: brcmnand: fix CS0 layout
mtd: rawnand: oxnas: Keep track of registered devices
mtd: rawnand: oxnas: Unregister all devices on error
mtd: rawnand: oxnas: Release all devices in the _remove() path
slimbus: core: Fix mismatch in of_node_get/put
HID: magicmouse: do not set up autorepeat
HID: quirks: Always poll Obins Anne Pro 2 keyboard
HID: quirks: Ignore Simply Automated UPB PIM
ALSA: line6: Perform sanity check for each URB creation
ALSA: line6: Sync the pending work cancel at disconnection
ALSA: usb-audio: Fix race against the error recovery URB submission
ALSA: hda/realtek - change to suitable link model for ASUS platform
ALSA: hda/realtek - Enable Speaker for ASUS UX533 and UX534
USB: c67x00: fix use after free in c67x00_giveback_urb
usb: dwc2: Fix shutdown callback in platform
usb: chipidea: core: add wakeup support for extcon
usb: gadget: function: fix missing spinlock in f_uac1_legacy
USB: serial: iuu_phoenix: fix memory corruption
USB: serial: cypress_m8: enable Simply Automated UPB PIM
USB: serial: ch341: add new Product ID for CH340
USB: serial: option: add GosunCn GM500 series
USB: serial: option: add Quectel EG95 LTE modem
virt: vbox: Fix VBGL_IOCTL_VMMDEV_REQUEST_BIG and _LOG req numbers to match upstream
virt: vbox: Fix guest capabilities mask check
virtio: virtio_console: add missing MODULE_DEVICE_TABLE() for rproc serial
serial: mxs-auart: add missed iounmap() in probe failure and remove
ovl: inode reference leak in ovl_is_inuse true case.
ovl: relax WARN_ON() when decoding lower directory file handle
ovl: fix unneeded call to ovl_change_flags()
fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
Revert "zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()"
mei: bus: don't clean driver pointer
Input: i8042 - add Lenovo XiaoXin Air 12 to i8042 nomux list
uio_pdrv_genirq: fix use without device tree and no interrupt
timer: Prevent base->clk from moving backward
timer: Fix wheel index calculation on last level
MIPS: Fix build for LTS kernel caused by backporting lpj adjustment
riscv: use 16KB kernel stack on 64-bit
hwmon: (emc2103) fix unable to change fan pwm1_enable attribute
powerpc/book3s64/pkeys: Fix pkey_access_permitted() for execute disable pkey
intel_th: pci: Add Jasper Lake CPU support
intel_th: pci: Add Tiger Lake PCH-H support
intel_th: pci: Add Emmitsburg PCH support
intel_th: Fix a NULL dereference when hub driver is not loaded
dmaengine: fsl-edma: Fix NULL pointer exception in fsl_edma_tx_handler
misc: atmel-ssc: lock with mutex instead of spinlock
thermal/drivers/cpufreq_cooling: Fix wrong frequency converted from power
arm64: ptrace: Override SPSR.SS when single-stepping is enabled
arm64: ptrace: Consistently use pseudo-singlestep exceptions
arm64: compat: Ensure upper 32 bits of x0 are zero on syscall return
sched: Fix unreliable rseq cpu_id for new tasks
sched/fair: handle case of task_h_load() returning 0
genirq/affinity: Handle affinity setting on inactive interrupts correctly
printk: queue wake_up_klogd irq_work only if per-CPU areas are ready
libceph: don't omit recovery_deletes in target_copy()
rxrpc: Fix trace string
spi: sprd: switch the sequence of setting WDG_LOAD_LOW and _HIGH
Linux 4.19.134
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ieeb9e03f4a2d51aeebe3a3eadd9c1b93a26088a0
|
||
|
|
0505cc4c90 |
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
[ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ] When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit |
||
|
|
ab3e3b23d8 |
cgroup: Iterate tasks that did not finish do_exit()
commit 9c974c77246460fa6a92c18554c3311c8c83c160 upstream.
PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.
Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.
Reported-by: Suren Baghdasaryan <surenb@google.com>
Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
6f2a9a3536 |
UPSTREAM: cgroup: Iterate tasks that did not finish do_exit()
PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.
Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd7738a83 ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.
Reported-by: Suren Baghdasaryan <surenb@google.com>
Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
(cherry picked from commit 9c974c77246460fa6a92c18554c3311c8c83c160)
Bug: 141213848
Bug: 146758430
Test: test_cgcore_destroy from linux-kselftest
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Iac57661b931129ed1e44b89675f8115bb89084ff
(cherry picked from commit 21ee296526c70d6dc3c64639406f156f39b80fd0)
|
||
|
|
3703043afb |
UPSTREAM: cgroup: add cgroup_parse_float()
cgroup already uses floating point for percent[ile] numbers and there are several controllers which want to take them as input. Add a generic parse helper to handle inputs. Update the interface convention documentation about the use of percentage numbers. While at it, also clarify the default time unit. Bug: 120440300 Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit a5e112e6424adb77d953eac20e6936b952fd6b32) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Change-Id: Ic1fcf21d7955eb8edd2e8e91517bca6aef41694f Signed-off-by: Quentin Perret <qperret@google.com> |
||
|
|
487e61785a |
Merge 4.19.66 into android-4.19
Changes in 4.19.66 scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure gcc-9: don't warn about uninitialized variable driver core: Establish order of operations for device_add and device_del via bitflag drivers/base: Introduce kill_device() libnvdimm/bus: Prevent duplicate device_unregister() calls libnvdimm/region: Register badblocks before namespaces libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock HID: wacom: fix bit shift for Cintiq Companion 2 HID: Add quirk for HP X1200 PIXART OEM mouse IB: directly cast the sockaddr union to aockaddr atm: iphase: Fix Spectre v1 vulnerability bnx2x: Disable multi-cos feature. ife: error out when nla attributes are empty ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6 ip6_tunnel: fix possible use-after-free on xmit ipip: validate header length in ipip_tunnel_xmit mlxsw: spectrum: Fix error path in mlxsw_sp_module_init() mvpp2: fix panic on module removal mvpp2: refactor MTU change code net: bridge: delete local fdb on device init failure net: bridge: mcast: don't delete permanent entries when fast leave is enabled net: fix ifindex collision during namespace removal net/mlx5e: always initialize frag->last_in_page net/mlx5: Use reversed order when unregister devices net: phylink: Fix flow control for fixed-link net: qualcomm: rmnet: Fix incorrect UL checksum offload logic net: sched: Fix a possible null-pointer dereference in dequeue_func() net sched: update vlan action for batched events operations net: sched: use temporary variable for actions indexes net/smc: do not schedule tx_work in SMC_CLOSED state NFC: nfcmrvl: fix gpio-handling regression ocelot: Cancel delayed work before wq destruction tipc: compat: allow tipc commands without arguments tun: mark small packets as owned by the tap sock net/mlx5: Fix modify_cq_in alignment net/mlx5e: Prevent encap flow counter update async to user query r8169: don't use MSI before RTL8168d compat_ioctl: pppoe: fix PPPOEIOCSFWD handling cgroup: Call cgroup_release() before __exit_signal() cgroup: Implement css_task_iter_skip() cgroup: Include dying leaders with live threads in PROCS iterations cgroup: css_task_iter_skip()'d iterators must be advanced before accessed cgroup: Fix css_task_iter_advance_css_set() cset skip condition spi: bcm2835: Fix 3-wire mode if DMA is enabled Linux 4.19.66 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id33ce169af8bf14a3791040b4cf923832ce84f6c |
||
|
|
4340d175b8 |
cgroup: Include dying leaders with live threads in PROCS iterations
commit c03cd7738a83b13739f00546166969342c8ff014 upstream. CSS_TASK_ITER_PROCS currently iterates live group leaders; however, this means that a process with dying leader and live threads will be skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't, which is confusing to say the least. Fix it by making cset track dying tasks and include dying leaders with live threads in PROCS iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
370b9e6399 |
cgroup: Implement css_task_iter_skip()
commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream. When a task is moved out of a cset, task iterators pointing to the task are advanced using the normal css_task_iter_advance() call. This is fine but we'll be tracking dying tasks on csets and thus moving tasks from cset->tasks to (to be added) cset->dying_tasks. When we remove a task from cset->tasks, if we advance the iterators, they may move over to the next cset before we had the chance to add the task back on the dying list, which can allow the task to escape iteration. This patch separates out skipping from advancing. Skipping only moves the affected iterators to the next pointer rather than fully advancing it and the following advancing will recognize that the cursor has already been moved forward and do the rest of advancing. This ensures that when a task moves from one list to another in its cset, as long as it moves in the right direction, it's always visible to iteration. This doesn't cause any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
f613e8938d |
Merge 4.19.53 into android-4.19
Changes in 4.19.53 drm/nouveau: add kconfig option to turn off nouveau legacy contexts. (v3) nouveau: Fix build with CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT disabled HID: multitouch: handle faulty Elo touch device HID: wacom: Don't set tool type until we're in range HID: wacom: Don't report anything prior to the tool entering range HID: wacom: Send BTN_TOUCH in response to INTUOSP2_BT eraser contact HID: wacom: Correct button numbering 2nd-gen Intuos Pro over Bluetooth HID: wacom: Sync INTUOSP2_BT touch state after each frame if necessary Revert "ALSA: hda/realtek - Improve the headset mic for Acer Aspire laptops" ALSA: oxfw: allow PCM capture for Stanton SCS.1m ALSA: hda/realtek - Update headset mode for ALC256 ALSA: firewire-motu: fix destruction of data for isochronous resources libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node fs/ocfs2: fix race in ocfs2_dentry_attach_lock() mm/vmscan.c: fix trying to reclaim unevictable LRU page signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO ptrace: restore smp_rmb() in __ptrace_may_access() iommu/arm-smmu: Avoid constant zero in TLBI writes i2c: acorn: fix i2c warning bcache: fix stack corruption by PRECEDING_KEY() bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() ASoC: cs42xx8: Add regcache mask dirty ASoC: fsl_asrc: Fix the issue about unsupported rate drm/i915/sdvo: Implement proper HDMI audio support for SDVO x86/uaccess, kcov: Disable stack protector ALSA: seq: Protect in-kernel ioctl calls with mutex ALSA: seq: Fix race of get-subscription call vs port-delete ioctls Revert "ALSA: seq: Protect in-kernel ioctl calls with mutex" s390/kasan: fix strncpy_from_user kasan checks Drivers: misc: fix out-of-bounds access in function param_set_kgdbts_var f2fs: fix to avoid accessing xattr across the boundary scsi: qedi: remove memset/memcpy to nfunc and use func instead scsi: qedi: remove set but not used variables 'cdev' and 'udev' scsi: lpfc: correct rcu unlock issue in lpfc_nvme_info_show scsi: lpfc: add check for loss of ndlp when sending RRQ arm64/mm: Inhibit huge-vmap with ptdump nvme: fix srcu locking on error return in nvme_get_ns_from_disk nvme: remove the ifdef around nvme_nvm_ioctl nvme: merge nvme_ns_ioctl into nvme_ioctl nvme: release namespace SRCU protection before performing controller ioctls nvme: fix memory leak for power latency tolerance platform/x86: pmc_atom: Add Lex 3I380D industrial PC to critclk_systems DMI table platform/x86: pmc_atom: Add several Beckhoff Automation boards to critclk_systems DMI table scsi: bnx2fc: fix incorrect cast to u64 on shift operation libnvdimm: Fix compilation warnings with W=1 selftests: fib_rule_tests: fix local IPv4 address typo selftests/timers: Add missing fflush(stdout) calls tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts usbnet: ipheth: fix racing condition KVM: arm/arm64: Move cc/it checks under hyp's Makefile to avoid instrumentation KVM: x86/pmu: mask the result of rdpmc according to the width of the counters KVM: x86/pmu: do not mask the value that is written to fixed PMUs KVM: s390: fix memory slot handling for KVM_SET_USER_MEMORY_REGION tools/kvm_stat: fix fields filter for child events drm/vmwgfx: integer underflow in vmw_cmd_dx_set_shader() leading to an invalid read drm/vmwgfx: NULL pointer dereference from vmw_cmd_dx_view_define() usb: dwc2: Fix DMA cache alignment issues usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression) USB: Fix chipmunk-like voice when using Logitech C270 for recording audio. USB: usb-storage: Add new ID to ums-realtek USB: serial: pl2303: add Allied Telesis VT-Kit3 USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode USB: serial: option: add Telit 0x1260 and 0x1261 compositions timekeeping: Repair ktime_get_coarse*() granularity RAS/CEC: Convert the timer callback to a workqueue RAS/CEC: Fix binary search function x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback x86/kasan: Fix boot with 5-level paging and KASAN x86/mm/KASLR: Compute the size of the vmemmap section properly x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled drm/edid: abstract override/firmware EDID retrieval drm: add fallback override/firmware EDID modes workaround rtc: pcf8523: don't return invalid date when battery is low Linux 4.19.53 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
c3b85bda41 |
cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
commit 18fa84a2db0e15b02baa5d94bdb5bd509175d2f6 upstream.
A PF_EXITING task can stay associated with an offline css. If such
task calls task_get_css(), it can get stuck indefinitely. This can be
triggered by BSD process accounting which writes to a file with
PF_EXITING set when racing against memcg disable as in the backtrace
at the end.
After this change, task_get_css() may return a css which was already
offline when the function was called. None of the existing users are
affected by this change.
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched detected stalls on CPUs/tasks:
...
NMI backtrace for cpu 0
...
Call Trace:
<IRQ>
dump_stack+0x46/0x68
nmi_cpu_backtrace.cold.2+0x13/0x57
nmi_trigger_cpumask_backtrace+0xba/0xca
rcu_dump_cpu_stacks+0x9e/0xce
rcu_check_callbacks.cold.74+0x2af/0x433
update_process_times+0x28/0x60
tick_sched_timer+0x34/0x70
__hrtimer_run_queues+0xee/0x250
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x56/0x110
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
...
btrfs_file_write_iter+0x31b/0x563
__vfs_write+0xfa/0x140
__kernel_write+0x4f/0x100
do_acct_process+0x495/0x580
acct_process+0xb9/0xdb
do_exit+0x748/0xa00
do_group_exit+0x3a/0xa0
get_signal+0x254/0x560
do_signal+0x23/0x5c0
exit_to_usermode_loop+0x5d/0xa0
prepare_exit_to_usermode+0x53/0x80
retint_user+0x8/0x8
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.2+
Fixes:
|
||
|
|
d885da678e |
Merge 4.19.34 into android-4.19
Changes in 4.19.34 arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals ext4: cleanup bh release code in ext4_ind_remove_space() tty/serial: atmel: Add is_half_duplex helper tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped CIFS: fix POSIX lock leak and invalid ptr deref h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux- f2fs: fix to adapt small inline xattr space in __find_inline_xattr() f2fs: fix to avoid deadlock in f2fs_read_inline_dir() tracing: kdb: Fix ftdump to not sleep net/mlx5: Avoid panic when setting vport rate net/mlx5: Avoid panic when setting vport mac, getting vport config gpio: gpio-omap: fix level interrupt idling include/linux/relay.h: fix percpu annotation in struct rchan sysctl: handle overflow for file-max net: stmmac: Avoid sometimes uninitialized Clang warnings enic: fix build warning without CONFIG_CPUMASK_OFFSTACK libbpf: force fixdep compilation at the start of the build scsi: hisi_sas: Set PHY linkrate when disconnected scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver x86/hyperv: Fix kernel panic when kexec on HyperV perf c2c: Fix c2c report for empty numa node mm/sparse: fix a bad comparison mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm, swap: bounds check swap_info array accesses to avoid NULL derefs mm,oom: don't kill global init via memory.oom.group memcg: killed threads should not invoke memcg OOM killer mm, mempolicy: fix uninit memory access mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! mm/slab.c: kmemleak no scan alien caches ocfs2: fix a panic problem caused by o2cb_ctl f2fs: do not use mutex lock in atomic context fs/file.c: initialize init_files.resize_wait page_poison: play nicely with KASAN cifs: use correct format characters dm thin: add sanity checks to thin-pool and external snapshot creation f2fs: fix to check inline_xattr_size boundary correctly cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED cifs: Fix NULL pointer dereference of devname netfilter: nf_tables: check the result of dereferencing base_chain->stats netfilter: conntrack: tcp: only close if RST matches exact sequence jbd2: fix invalid descriptor block checksum fs: fix guard_bio_eod to check for real EOD errors tools lib traceevent: Fix buffer overflow in arg_eval PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove() wil6210: check null pointer in _wil_cfg80211_merge_extra_ies mt76: fix a leaked reference by adding a missing of_node_put crypto: crypto4xx - add missing of_node_put after of_device_is_available crypto: cavium/zip - fix collision with generic cra_driver_name usb: chipidea: Grab the (legacy) USB PHY by phandle first powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc coresight: etm4x: Add support to enable ETMv4.2 serial: 8250_pxa: honor the port number from devicetree ARM: 8840/1: use a raw_spinlock_t in unwind iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback btrfs: qgroup: Make qgroup async transaction commit more aggressive mmc: omap: fix the maximum timeout setting net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat e1000e: Fix -Wformat-truncation warnings mlxsw: spectrum: Avoid -Wformat-truncation warnings platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN platform/mellanox: mlxreg-hotplug: Fix KASAN warning loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() IB/mlx4: Increase the timeout for CM cache clk: fractional-divider: check parent rate only if flag is set perf annotate: Fix getting source line failure ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of() cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies efi: cper: Fix possible out-of-bounds access s390/ism: ignore some errors during deregistration scsi: megaraid_sas: return error when create DMA pool failed scsi: fcoe: make use of fip_mode enum complete drm/amd/display: Clear stream->mode_changed after commit perf test: Fix failure of 'evsel-tp-sched' test on s390 mwifiex: don't advertise IBSS features without FW support perf report: Don't shadow inlined symbol with different addr range SoC: imx-sgtl5000: add missing put_device() media: ov7740: fix runtime pm initialization media: sh_veu: Correct return type for mem2mem buffer helpers media: s5p-jpeg: Correct return type for mem2mem buffer helpers media: rockchip/rga: Correct return type for mem2mem buffer helpers media: s5p-g2d: Correct return type for mem2mem buffer helpers media: mx2_emmaprp: Correct return type for mem2mem buffer helpers media: mtk-jpeg: Correct return type for mem2mem buffer helpers mt76: usb: do not run mt76u_queues_deinit twice xen/gntdev: Do not destroy context while dma-bufs are in use vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 HID: intel-ish-hid: avoid binding wrong ishtp_cl_device cgroup, rstat: Don't flush subtree root unless necessary jbd2: fix race when writing superblock leds: lp55xx: fix null deref on firmware load failure perf report: Add s390 diagnosic sampling descriptor size iwlwifi: pcie: fix emergency path ACPI / video: Refactor and fix dmi_is_desktop() selftests: skip seccomp get_metadata test if not real root kprobes: Prohibit probing on bsearch() kprobes: Prohibit probing on RCU debug routine netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm ARM: 8833/1: Ensure that NEON code always compiles with Clang ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins ALSA: PCM: check if ops are defined before suspending PCM ath10k: fix shadow register implementation for WCN3990 usb: f_fs: Avoid crash due to out-of-scope stack ptr access sched/topology: Fix percpu data types in struct sd_data & struct s_data bcache: fix input overflow to cache set sysfs file io_error_halflife bcache: fix input overflow to sequential_cutoff bcache: fix potential div-zero error of writeback_rate_i_term_inverse bcache: improve sysfs_strtoul_clamp() genirq: Avoid summation loops for /proc/stat net: marvell: mvpp2: fix stuck in-band SGMII negotiation iw_cxgb4: fix srqidx leak during connection abort net: phy: consider latched link-down status in polling mode fbdev: fbmem: fix memory access if logo is bigger than the screen cdrom: Fix race condition in cdrom_sysctl_register drm: rcar-du: add missing of_node_put drm/amd/display: Don't re-program planes for DPMS changes drm/amd/display: Disconnect mpcc when changing tg perf/aux: Make perf_event accessible to setup_aux() e1000e: fix cyclic resets at link up with active tx e1000e: Exclude device from suspend direct complete optimization platform/x86: intel_pmc_core: Fix PCH IP sts reading i2c: of: Try to find an I2C adapter matching the parent staging: spi: mt7621: Add return code check on device_reset() iwlwifi: mvm: fix RFH config command with >=10 CPUs ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK efi/memattr: Don't bail on zero VA if it equals the region's PA sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() drm/vkms: Bugfix extra vblank frame ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted soc: qcom: gsbi: Fix error handling in gsbi_probe() mt7601u: bump supported EEPROM version ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of ARM: avoid Cortex-A9 livelock on tight dmb loops block, bfq: fix in-service-queue check for queue merging bpf: fix missing prototype warnings selftests/bpf: skip verifier tests for unsupported program types powerpc/64s: Clear on-stack exception marker upon exception return cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state tty: increase the default flip buffer limit to 2*640K powerpc/pseries: Perform full re-add of CPU for topology update post-migration drm/amd/display: Enable vblank interrupt during CRC capture ALSA: dice: add support for Solid State Logic Duende Classic/Mini usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded platform/x86: intel-hid: Missing power button release on some Dell models perf script python: Use PyBytes for attr in trace-event-python perf script python: Add trace_context extension module to sys.modules media: mt9m111: set initial frame size other than 0x0 hwrng: virtio - Avoid repeated init of completion soc/tegra: fuse: Fix illegal free of IO base address HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit f2fs: UBSAN: set boolean value iostat_enable correctly hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable cpu/hotplug: Mute hotplug lockdep during init dmaengine: imx-dma: fix warning comparison of distinct pointer types dmaengine: qcom_hidma: assign channel cookie correctly dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_* netfilter: physdev: relax br_netfilter dependency media: rcar-vin: Allow independent VIN link enablement media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins drm: Auto-set allow_fb_modifiers when given modifiers at plane init drm/nouveau: Stop using drm_crtc_force_disable x86/build: Specify elf_i386 linker emulation explicitly for i386 objects selinux: do not override context on context mounts brcmfmac: Use firmware_request_nowarn for the clm_blob wlcore: Fix memory leak in case wl12xx_fetch_firmware failure x86/build: Mark per-CPU symbols as absolute explicitly for LLD drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup clk: meson: clean-up clock registration clk: rockchip: fix frac settings of GPLL clock for rk3328 dmaengine: tegra: avoid overflow of byte tracking Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers net: stmmac: Avoid one more sometimes uninitialized Clang warning ACPI / video: Extend chassis-type detection with a "Lunch Box" check bcache: fix potential div-zero error of writeback_rate_p_term_inverse kprobes/x86: Blacklist non-attachable interrupt functions Linux 4.19.34 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
|
d0bc74c563 |
cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
[ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]
The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.
However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does
for (;;) {
int pid = fork();
assert(pid >= 0);
if (pid)
wait(NULL);
else
exit(0);
}
can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().
Test-case:
mkdir -p /tmp/CG
mount -t cgroup2 none /tmp/CG
echo '+pids' > /tmp/CG/cgroup.subtree_control
mkdir /tmp/CG/PID
echo 2 > /tmp/CG/PID/pids.max
perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
echo $! > /tmp/CG/PID/cgroup.procs
Without this patch the forking process fails soon after migration.
Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).
Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
dc9cd29ded |
UPSTREAM: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 2ce7135adc9ad081aa3c49744144376ac74fea60) Bug: 127712811 Test: lmkd in PSI mode Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
||
|
|
7723628101 |
bpf: Introduce bpf_skb_ancestor_cgroup_id helper
== Problem description == It's useful to be able to identify cgroup associated with skb in TC so that a policy can be applied to this skb, and existing bpf_skb_cgroup_id helper can help with this. Though in real life cgroup hierarchy and hierarchy to apply a policy to don't map 1:1. It's often the case that there is a container and corresponding cgroup, but there are many more sub-cgroups inside container, e.g. because it's delegated to containerized application to control resources for its subsystems, or to separate application inside container from infra that belongs to containerization system (e.g. sshd). At the same time it may be useful to apply a policy to container as a whole. If multiple containers like this are run on a host (what is often the case) and many of them have sub-cgroups, it may not be possible to apply per-container policy in TC with existing helpers such as bpf_skb_under_cgroup or bpf_skb_cgroup_id: * bpf_skb_cgroup_id will return id of immediate cgroup associated with skb, i.e. if it's a sub-cgroup inside container, it can't be used to identify container's cgroup; * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale, i.e. if there are N containers on a host and a policy has to be applied to M of them (0 <= M <= N), it'd require M calls to bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild & load new BPF program. == Solution == The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be used to get id of cgroup v2 that is an ancestor of cgroup associated with skb at specified level of cgroup hierarchy. That way admin can place all containers on one level of cgroup hierarchy (what is a good practice in general and already used in many configurations) and identify specific cgroup on this level no matter what sub-cgroup skb is associated with. E.g. if there is a cgroup hierarchy: root/ root/container1/ root/container1/app11/ root/container1/app11/sub-app-a/ root/container1/app12/ root/container2/ root/container2/app21/ root/container2/app22/ root/container2/app22/sub-app-b/ , then having skb associated with root/container1/app11/sub-app-a/ it's possible to get ancestor at level 1, what is container1 and apply policy for this container, or apply another policy if it's container2. Policies can be kept e.g. in a hash map where key is a container cgroup id and value is an action. Levels where container cgroups are created are usually known in advance whether cgroup hierarchy inside container may be hard to predict especially in case when its creation is delegated to containerized application. == Implementation details == The helper gets ancestor by walking parents up to specified level. Another option would be to get different kind of "id" from cgroup->ancestor_ids[level] and use it with idr_find() to get struct cgroup for ancestor. But that would require radix lookup what doesn't seem to be better (at least it's not obviously better). Format of return value of the new helper is same as that of bpf_skb_cgroup_id. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> |
||
|
|
0fa294fb19 |
cgroup: Replace cgroup_rstat_mutex with a spinlock
Currently, rstat flush path is protected with a mutex which is fine as all the existing users are from interface file show path. However, rstat is being generalized for use by controllers and flushing from atomic contexts will be necessary. This patch replaces cgroup_rstat_mutex with a spinlock and adds a irq-safe flush function - cgroup_rstat_flush_irqsafe(). Explicit yield handling is added to the flush path so that other flush functions can yield to other threads and flushers. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
6162cef0f7 |
cgroup: Factor out and expose cgroup_rstat_*() interface functions
cgroup_rstat is being generalized so that controllers can use it too. This patch factors out and exposes the following interface functions. * cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for consistency. * cgroup_rstat_flush_hold/release(): Factored out from base stat implementation. * cgroup_rstat_flush(): Verbatim expose. While at it, drop assert on cgroup_rstat_mutex in cgroup_base_stat_flush() as it crosses layers and make a minor comment update. v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
22714a2ba4 |
Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Cgroup2 cpu controller support is finally merged.
- Basic cpu statistics support to allow monitoring by default without
the CPU controller enabled.
- cgroup2 cpu controller support.
- /sys/kernel/cgroup files to help dealing with new / optional
features"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: export list of cgroups v2 features using sysfs
cgroup: export list of delegatable control files using sysfs
cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
MAINTAINERS: relocate cpuset.c
cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
sched: Implement interface for cgroup unified hierarchy
sched: Misc preps for cgroup unified hierarchy interface
sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cgroup: statically initialize init_css_set->dfl_cgrp
cgroup: Implement cgroup2 basic CPU usage accounting
cpuacct: Introduce cgroup_account_cputime[_field]()
sched/cputime: Expose cputime_adjust()
|
||
|
|
b24413180f |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
d41bf8c9de |
cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
The basic cpu stat is currently shown with "cpu." prefix in cgroup.stat, and the same information is duplicated in cpu.stat when cpu controller is enabled. This is ugly and not very scalable as we want to expand the coverage of stat information which is always available. This patch makes cgroup core always create "cpu.stat" file and show the basic cpu stat there and calls the cpu controller to show the extra stats when enabled. This ensures that the same information isn't presented in multiple places and makes future expansion of basic stats easier. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> |
||
|
|
041cd640b2 |
cgroup: Implement cgroup2 basic CPU usage accounting
In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on. cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.
cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default. While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense. Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.
This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.
* All accountings are done per-cpu and don't propagate immediately.
It just bumps the per-cgroup per-cpu counters and links to the
parent's updated list if not already on it.
* On a read, the per-cpu counters are collected into the global ones
and then propagated upwards. Only the per-cpu counters which have
changed since the last read are propagated.
* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
prefix. Total usage is collected from scheduling events. User/sys
breakdown is sourced from tick sampling and adjusted to the usage
using cputime_adjust().
This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).
v2: Minor changes and documentation updates as suggested by Waiman and
Roman.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
|
||
|
|
d2cc5ed694 |
cpuacct: Introduce cgroup_account_cputime[_field]()
Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge() and cgroup_account_field(). This doesn't introduce any functional changes and will be used to add cgroup basic resource accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> |
||
|
|
a0725ab0c7 |
Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
|
||
|
|
3e48930cc7 |
cgroup: misc changes
Misc trivial changes to prepare for future changes. No functional difference. * Expose cgroup_get(), cgroup_tryget() and cgroup_parent(). * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp. * Rename cgroup_stats_show() to cgroup_stat_show() for consistency with the file name. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
69fd5c3917 |
blktrace: add an option to allow displaying cgroup path
By default we output cgroup id in blktrace. This adds an option to display cgroup path. Since get cgroup path is a relativly heavy operation, we don't enable it by default. with the option enabled, blktrace will output something like this: dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd] Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
121508df44 |
cgroup: export fhandle info for a cgroup
Add an API to export cgroup fhandle info. We don't export a full 'struct file_handle', there are unrequired info. Sepcifically, cgroup is always a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle, we only need export the inode number and generation number just like what generic_fh_to_dentry does. And we can avoid the overhead of getting an inode too, since kernfs_node_id (ino and generation) has all the info required. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
c53cd490b1 |
kernfs: introduce kernfs_node_id
inode number and generation can identify a kernfs node. We are going to export the identification by exportfs operations, so put ino and generation into a separate structure. It's convenient when later patches use the identification. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
450ee0c1fe |
cgroup: implement CSS_TASK_ITER_THREADED
cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the dom_cgrp to which the processes of the subtree conceptually belong
and domain-level resource consumptions not tied to any specific task
are charged. In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.
This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a dom_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets. "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes. This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.
Task iteration is implemented nested in css_set iteration. If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.
v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new
enable-threaded-per-cgroup behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
454000adaa |
cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
cgroup v2 is in the process of growing thread granularity support. A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.
The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged. Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.
This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.
* cgroup->dom_cgrp points to self for normal and thread roots. For
proper thread subtree members, points to the dom_cgrp (the thread
root).
* css_set->dom_cset points to self if for normal and thread roots. If
threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
The dom_cgrp serves as the resource domain and keeps the matching
csses available. The dom_cset holds those csses and makes them
easily accessible.
* All threaded csets are linked on their dom_csets to enable iteration
of all threaded tasks.
* cgroup->nr_threaded_children keeps track of the number of threaded
children.
This patch adds the above but doesn't actually use them yet. The
following patches will build on top.
v4: ->nr_threaded_children added.
v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new
enable-threaded-per-cgroup behavior.
v2: Added cgroup_is_threaded() helper.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
bc2fb7ed08 |
cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
788b950c62 |
cgroup: distinguish local and children populated states
cgrp->populated_cnt counts both local (the cgroup's populated css_sets) and subtree proper (populated children) so that it's only zero when the whole subtree, including self, is empty. This patch splits the counter into two so that local and children populated states are tracked separately. It allows finer-grained tests on the state of the hierarchy which will be used to replace css_set walking local populated test. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
41c25707d2 |
cpuset: consider dying css as offline
In most cases, a cgroup controller don't care about the liftimes of cgroups. For the controller, a css becomes online when ->css_online() is called on it and offline when ->css_offline() is called. However, cpuset is special in that the user interface it exposes cares whether certain cgroups exist or not. Combined with the RCU delay between cgroup removal and css offlining, this can lead to user visible behavior oddities where operations which should succeed after cgroup removals fail for some time period. The effects of cgroup removals are delayed when seen from userland. This patch adds css_is_dying() which tests whether offline is pending and updates is_cpuset_online() so that the function returns false also while offline is pending. This gets rid of the userland visible delays. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com> Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
9410091dd5 |
Merge branch 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "Nothing major. Two notable fixes are Li's second stab at fixing the long-standing race condition in the mount path and suppression of spurious warning from cgroup_get(). All other changes are trivial" * 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: mark cgroup_get() with __maybe_unused cgroup: avoid attaching a cgroup root to two different superblocks, take 2 cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() cgroup: move cgroup_subsys_state parent field for cache locality cpuset: Remove cpuset_update_active_cpus()'s parameter. cgroup: switch to BUG_ON() cgroup: drop duplicate header nsproxy.h kernel: convert css_set.refcount from atomic_t to refcount_t kernel: convert cgroup_namespace.count from atomic_t to refcount_t |
||
|
|
8f48cfabac |
cgroup: drop duplicate header nsproxy.h
Drop duplicate header nsproxy.h from linux/cgroup.h. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
77f88796ce |
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
387ad9674b |
kernel: convert cgroup_namespace.count from atomic_t to refcount_t
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
7b4632f048 |
cgroup: fix a comment typo
Fix a comment typo in cgroup.h. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
f34d3606f7 |
Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - tracepoints for basic cgroup management operations added - kernfs and cgroup path formatting functions updated to behave in the style of strlcpy() - non-critical bug fixes * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() cpuset: fix error handling regression in proc_cpuset_show() cgroup: add tracepoints for basic operations cgroup: make cgroup_path() and friends behave in the style of strlcpy() kernfs: remove kernfs_path_len() kernfs: make kernfs_path*() behave in the style of strlcpy() kernfs: add dummy implementation of kernfs_path_from_node() |
||
|
|
14986a34e1 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.
The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.
To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.
Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.
There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.
These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...
|
||
|
|
aed704b7a6 |
cgroup: Add task_under_cgroup_hierarchy cgroup inline function to headers
This commit adds an inline function to cgroup.h to check whether a given task is under a given cgroup hierarchy. This is to avoid having to put ifdefs in .c files to gate access to cgroups. When cgroups are disabled this always returns true. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
4c737b41de |
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
cgroup_path() and friends used to format the path from the end and thus the resulting path usually didn't start at the start of the passed in buffer. Also, when the buffer was too small, the partial result was truncated from the head rather than tail and there was no way to tell how long the full path would be. These make the functions less robust and more awkward to use. With recent updates to kernfs_path(), cgroup_path() and friends can be made to behave in strlcpy() style. * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now always return the length of the full path. If buffer is too small, it contains nul terminated truncated output. * All users updated accordingly. v2: cgroup_path() usage in kernel/sched/debug.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Peter Zijlstra <peterz@infradead.org> |
||
|
|
3abb1d90f5 |
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs_path*() functions always return the length of the full path but the path content is undefined if the length is larger than the provided buffer. This makes its behavior different from strlcpy() and requires error handling in all its users even when they don't care about truncation. In addition, the implementation can actully be simplified by making it behave properly in strlcpy() style. * Update kernfs_path_from_node_locked() to always fill up the buffer with path. If the buffer is not large enough, the output is truncated and terminated. * kernfs_path() no longer needs error handling. Make it a simple inline wrapper around kernfs_path_from_node(). * sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling. Updated accordingly. * cgroup_path()'s use of kernfs_path() updated to retain the old behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> |
||
|
|
d08311dd6f |
cgroupns: Add a limit on the number of cgroup namespaces
Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
1f3fe7ebf6 |
cgroup: Add cgroup_get_from_fd
Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
a79a908fd2 |
cgroup: introduce cgroup namespaces
Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgroup namespace (referred as cgroupns-root). The main purpose of cgroup namespace is to virtualize the contents of /proc/self/cgroup file. Processes inside a cgroup namespace are only able to see paths relative to their namespace root (unless they are moved outside of their cgroupns-root, at which point they will see a relative path from their cgroupns-root). For a correctly setup container this enables container-tools (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized containers without leaking system level cgroup hierarchy to the task. This patch only implements the 'unshare' part of the cgroupns. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
34a9304a96 |
Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - cgroup v2 interface is now official. It's no longer hidden behind a devel flag and can be mounted using the new cgroup2 fs type. Unfortunately, cpu v2 interface hasn't made it yet due to the discussion around in-process hierarchical resource distribution and only memory and io controllers can be used on the v2 interface at the moment. - The existing documentation which has always been a bit of mess is relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt is added as the authoritative documentation for the v2 interface. - Some features are added through for-4.5-ancestor-test branch to enable netfilter xt_cgroup match to use cgroup v2 paths. The actual netfilter changes will be merged through the net tree which pulled in the said branch. - Various cleanups * 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rename cgroup documentations cgroup: fix a typo. cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX. cgroup: demote subsystem init messages to KERN_DEBUG cgroup: Fix uninitialized variable warning cgroup: put controller Kconfig options in meaningful order cgroup: clean up the kernel configuration menu nomenclature cgroup_pids: fix a typo. Subject: cgroup: Fix incomplete dd command in blkio documentation cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends cpuset: Replace all instances of time_t with time64_t cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/ cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type |