89 Commits

Author SHA1 Message Date
Tejun Heo
c2a170c293 FROMGIT: cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9851 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
(cherry picked from commit 07fd5b6cdf3cc30bfde8fe0f644771688be04447
 https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-5.19-fixes)
Bug: 235577024
Change-Id: Ieaf1c0c8fc23753570897fd6e48a54335ab939ce
Signed-off-by: Steve Muckle <smuckle@google.com>
Git-commit: d1faa010ca160216a17435b81c642daf08ecbcbd
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Srinivasarao Pathipati <quic_c_spathi@quicinc.com>
2022-06-30 08:19:43 -07:00
Suren Baghdasaryan
f7c2472acb BACKPORT: cgroup: make per-cgroup pressure stall tracking configurable
PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Link:  https://lore.kernel.org/patchwork/patch/1435705
(cherry picked from commit 3958e2d0c34e18c41b60dc01832bd670a59ef70f
 https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git tj)

Conflicts:
        include/linux/cgroup-defs.h
        kernel/cgroup/cgroup.c

1. Trivial merge conflict in cgroup-defs.h due to missing CFTYPE_DEBUG
2. Changed flags to (CFTYPE_NOT_ON_ROOT | CFTYPE_PRESSURE) in cgroup.c
because in 4.19 psi files were allowed only in non-root cgroups.

Bug: 178872719
Bug: 191734423
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ifc8fbc52f9a1131d7c2668edbb44c525c76c3360
Git-commit: 92c6dd6a65
Git-repo: https://android.googlesource.com/kernel/common/
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
2021-07-01 23:43:00 -07:00
Roman Gushchin
5318f3163c BACKPORT: cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa
(cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873)
cgroup-defs.h: use the struct cgroup_freezer_state and the
freezer field from definitions in I6221a975c04f06249a4f8d693852776ae08a8d8e
sched.h: use the frozen field defined in
I6221a975c04f06249a4f8d693852776ae08a8d8e
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
2020-08-26 15:35:17 -07:00
Marco Ballesio
489d25a567 ANDROID: cgroups: add v2 freezer ABI changes
introduce the freezer_state struct to be used by the v2 freezer
backports

Test: built and booted
Bug: 163547360
Signed-off-by: Marco Ballesio <balejs@google.com>
Change-Id: I4705ee9787f35db58e27de339d5264d1cd45007a
2020-08-12 19:35:27 -07:00
Marco Ballesio
0e5ea532a0 ANDROID: cgroups: ABI padding
ABI padding in struct cgroup

Bug: 163547360
Test: built and booted the kernel
Change-Id: Ie6ef8bdc4a62f57039d3b456cf125db4582b255a
Signed-off-by: Marco Ballesio <balejs@google.com>
2020-08-12 19:35:27 -07:00
Greg Kroah-Hartman
83f60b3043 ANDROID: GKI: preserve ABI for struct sock_cgroup_data
In commit ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for
sk_clone_lock()") the struct sock_cgroup_data fields are changed a bit,
in a way that keeps the same size and functionality, it just packs
another bit into the structure.

Because this does not really change the abi, tell the genksyms detector
that nothing has changed so that the ABI checker is happy.

Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ibc748616140ac0da69b04699cfd2322dc4e5d1f4
Signed-off-by: Will McVicker <willmcvicker@google.com>
2020-07-23 10:12:34 -07:00
Greg Kroah-Hartman
b41585fc93 Merge 4.19.134 into android-4.19-stable
Changes in 4.19.134
	perf: Make perf able to build with latest libbfd
	net: rmnet: fix lower interface leak
	genetlink: remove genl_bind
	ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg
	l2tp: remove skb_dst_set() from l2tp_xmit_skb()
	llc: make sure applications use ARPHRD_ETHER
	net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb
	net_sched: fix a memory leak in atm_tc_init()
	net: usb: qmi_wwan: add support for Quectel EG95 LTE modem
	tcp: fix SO_RCVLOWAT possible hangs under high mem pressure
	tcp: make sure listeners don't initialize congestion-control state
	tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()
	tcp: md5: do not send silly options in SYNCOOKIES
	tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers
	tcp: md5: allow changing MD5 keys in all socket states
	cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
	cgroup: Fix sock_cgroup_data on big-endian.
	sched: consistently handle layer3 header accesses in the presence of VLANs
	vlan: consolidate VLAN parsing code and limit max parsing depth
	drm/msm: fix potential memleak in error branch
	drm/exynos: fix ref count leak in mic_pre_enable
	m68k: nommu: register start of the memory with memblock
	m68k: mm: fix node memblock init
	arm64/alternatives: use subsections for replacement sequences
	tpm_tis: extra chip->ops check on error path in tpm_tis_core_init
	gfs2: read-only mounts should grab the sd_freeze_gl glock
	i2c: eg20t: Load module automatically if ID matches
	arm64/alternatives: don't patch up internal branches
	iio:magnetometer:ak8974: Fix alignment and data leak issues
	iio:humidity:hdc100x Fix alignment and data leak issues
	iio: magnetometer: ak8974: Fix runtime PM imbalance on error
	iio: mma8452: Add missed iio_device_unregister() call in mma8452_probe()
	iio: pressure: zpa2326: handle pm_runtime_get_sync failure
	iio:humidity:hts221 Fix alignment and data leak issues
	iio:pressure:ms5611 Fix buffer element alignment
	iio:health:afe4403 Fix timestamp alignment and prevent data leak.
	spi: fix initial SPI_SR value in spi-fsl-dspi
	spi: spi-fsl-dspi: Fix lockup if device is shutdown during SPI transfer
	net: dsa: bcm_sf2: Fix node reference count
	of: of_mdio: Correct loop scanning logic
	Revert "usb/ohci-platform: Fix a warning when hibernating"
	Revert "usb/xhci-plat: Set PM runtime as active on resume"
	Revert "usb/ehci-platform: Set PM runtime as active on resume"
	net: sfp: add support for module quirks
	net: sfp: add some quirks for GPON modules
	HID: quirks: Remove ITE 8595 entry from hid_have_special_driver
	ARM: at91: pm: add quirk for sam9x60's ulp1
	scsi: sr: remove references to BLK_DEV_SR_VENDOR, leave it enabled
	ALSA: usb-audio: Create a registration quirk for Kingston HyperX Amp (0951:16d8)
	doc: dt: bindings: usb: dwc3: Update entries for disabling SS instances in park mode
	mmc: sdhci: do not enable card detect interrupt for gpio cd type
	ALSA: usb-audio: Rewrite registration quirk handling
	ACPI: video: Use native backlight on Acer Aspire 5783z
	ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Alpha S
	Input: mms114 - add extra compatible for mms345l
	ACPI: video: Use native backlight on Acer TravelMate 5735Z
	ALSA: usb-audio: Add registration quirk for Kingston HyperX Cloud Flight S
	iio:health:afe4404 Fix timestamp alignment and prevent data leak.
	phy: sun4i-usb: fix dereference of pointer phy0 before it is null checked
	arm64: dts: meson: add missing gxl rng clock
	spi: spi-sun6i: sun6i_spi_transfer_one(): fix setting of clock rate
	usb: gadget: udc: atmel: fix uninitialized read in debug printk
	staging: comedi: verify array index is correct before using it
	Revert "thermal: mediatek: fix register index error"
	ARM: dts: socfpga: Align L2 cache-controller nodename with dtschema
	regmap: debugfs: Don't sleep while atomic for fast_io regmaps
	copy_xstate_to_kernel: Fix typo which caused GDB regression
	apparmor: ensure that dfa state tables have entries
	perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode
	soc: qcom: rpmh: Update dirty flag only when data changes
	soc: qcom: rpmh: Invalidate SLEEP and WAKE TCSes before flushing new data
	soc: qcom: rpmh-rsc: Clear active mode configuration for wake TCS
	soc: qcom: rpmh-rsc: Allow using free WAKE TCS for active request
	mtd: rawnand: marvell: Use nand_cleanup() when the device is not yet registered
	mtd: rawnand: marvell: Fix probe error path
	mtd: rawnand: timings: Fix default tR_max and tCCS_min timings
	mtd: rawnand: brcmnand: fix CS0 layout
	mtd: rawnand: oxnas: Keep track of registered devices
	mtd: rawnand: oxnas: Unregister all devices on error
	mtd: rawnand: oxnas: Release all devices in the _remove() path
	slimbus: core: Fix mismatch in of_node_get/put
	HID: magicmouse: do not set up autorepeat
	HID: quirks: Always poll Obins Anne Pro 2 keyboard
	HID: quirks: Ignore Simply Automated UPB PIM
	ALSA: line6: Perform sanity check for each URB creation
	ALSA: line6: Sync the pending work cancel at disconnection
	ALSA: usb-audio: Fix race against the error recovery URB submission
	ALSA: hda/realtek - change to suitable link model for ASUS platform
	ALSA: hda/realtek - Enable Speaker for ASUS UX533 and UX534
	USB: c67x00: fix use after free in c67x00_giveback_urb
	usb: dwc2: Fix shutdown callback in platform
	usb: chipidea: core: add wakeup support for extcon
	usb: gadget: function: fix missing spinlock in f_uac1_legacy
	USB: serial: iuu_phoenix: fix memory corruption
	USB: serial: cypress_m8: enable Simply Automated UPB PIM
	USB: serial: ch341: add new Product ID for CH340
	USB: serial: option: add GosunCn GM500 series
	USB: serial: option: add Quectel EG95 LTE modem
	virt: vbox: Fix VBGL_IOCTL_VMMDEV_REQUEST_BIG and _LOG req numbers to match upstream
	virt: vbox: Fix guest capabilities mask check
	virtio: virtio_console: add missing MODULE_DEVICE_TABLE() for rproc serial
	serial: mxs-auart: add missed iounmap() in probe failure and remove
	ovl: inode reference leak in ovl_is_inuse true case.
	ovl: relax WARN_ON() when decoding lower directory file handle
	ovl: fix unneeded call to ovl_change_flags()
	fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
	Revert "zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()"
	mei: bus: don't clean driver pointer
	Input: i8042 - add Lenovo XiaoXin Air 12 to i8042 nomux list
	uio_pdrv_genirq: fix use without device tree and no interrupt
	timer: Prevent base->clk from moving backward
	timer: Fix wheel index calculation on last level
	MIPS: Fix build for LTS kernel caused by backporting lpj adjustment
	riscv: use 16KB kernel stack on 64-bit
	hwmon: (emc2103) fix unable to change fan pwm1_enable attribute
	powerpc/book3s64/pkeys: Fix pkey_access_permitted() for execute disable pkey
	intel_th: pci: Add Jasper Lake CPU support
	intel_th: pci: Add Tiger Lake PCH-H support
	intel_th: pci: Add Emmitsburg PCH support
	intel_th: Fix a NULL dereference when hub driver is not loaded
	dmaengine: fsl-edma: Fix NULL pointer exception in fsl_edma_tx_handler
	misc: atmel-ssc: lock with mutex instead of spinlock
	thermal/drivers/cpufreq_cooling: Fix wrong frequency converted from power
	arm64: ptrace: Override SPSR.SS when single-stepping is enabled
	arm64: ptrace: Consistently use pseudo-singlestep exceptions
	arm64: compat: Ensure upper 32 bits of x0 are zero on syscall return
	sched: Fix unreliable rseq cpu_id for new tasks
	sched/fair: handle case of task_h_load() returning 0
	genirq/affinity: Handle affinity setting on inactive interrupts correctly
	printk: queue wake_up_klogd irq_work only if per-CPU areas are ready
	libceph: don't omit recovery_deletes in target_copy()
	rxrpc: Fix trace string
	spi: sprd: switch the sequence of setting WDG_LOAD_LOW and _HIGH
	Linux 4.19.134

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ieeb9e03f4a2d51aeebe3a3eadd9c1b93a26088a0
2020-07-22 13:03:12 +02:00
Cong Wang
6d584c6e29 cgroup: Fix sock_cgroup_data on big-endian.
[ Upstream commit 14b032b8f8fce03a546dcf365454bec8c4a58d7d ]

In order for no_refcnt and is_data to be the lowest order two
bits in the 'val' we have to pad out the bitfield of the u8.

Fixes: ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:32:00 +02:00
Cong Wang
0505cc4c90 cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
[ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ]

When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:32:00 +02:00
Greg Kroah-Hartman
487e61785a Merge 4.19.66 into android-4.19
Changes in 4.19.66
	scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure
	gcc-9: don't warn about uninitialized variable
	driver core: Establish order of operations for device_add and device_del via bitflag
	drivers/base: Introduce kill_device()
	libnvdimm/bus: Prevent duplicate device_unregister() calls
	libnvdimm/region: Register badblocks before namespaces
	libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
	libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
	HID: wacom: fix bit shift for Cintiq Companion 2
	HID: Add quirk for HP X1200 PIXART OEM mouse
	IB: directly cast the sockaddr union to aockaddr
	atm: iphase: Fix Spectre v1 vulnerability
	bnx2x: Disable multi-cos feature.
	ife: error out when nla attributes are empty
	ip6_gre: reload ipv6h in prepare_ip6gre_xmit_ipv6
	ip6_tunnel: fix possible use-after-free on xmit
	ipip: validate header length in ipip_tunnel_xmit
	mlxsw: spectrum: Fix error path in mlxsw_sp_module_init()
	mvpp2: fix panic on module removal
	mvpp2: refactor MTU change code
	net: bridge: delete local fdb on device init failure
	net: bridge: mcast: don't delete permanent entries when fast leave is enabled
	net: fix ifindex collision during namespace removal
	net/mlx5e: always initialize frag->last_in_page
	net/mlx5: Use reversed order when unregister devices
	net: phylink: Fix flow control for fixed-link
	net: qualcomm: rmnet: Fix incorrect UL checksum offload logic
	net: sched: Fix a possible null-pointer dereference in dequeue_func()
	net sched: update vlan action for batched events operations
	net: sched: use temporary variable for actions indexes
	net/smc: do not schedule tx_work in SMC_CLOSED state
	NFC: nfcmrvl: fix gpio-handling regression
	ocelot: Cancel delayed work before wq destruction
	tipc: compat: allow tipc commands without arguments
	tun: mark small packets as owned by the tap sock
	net/mlx5: Fix modify_cq_in alignment
	net/mlx5e: Prevent encap flow counter update async to user query
	r8169: don't use MSI before RTL8168d
	compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
	cgroup: Call cgroup_release() before __exit_signal()
	cgroup: Implement css_task_iter_skip()
	cgroup: Include dying leaders with live threads in PROCS iterations
	cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
	cgroup: Fix css_task_iter_advance_css_set() cset skip condition
	spi: bcm2835: Fix 3-wire mode if DMA is enabled
	Linux 4.19.66

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id33ce169af8bf14a3791040b4cf923832ce84f6c
2019-08-11 15:21:36 +02:00
Tejun Heo
4340d175b8 cgroup: Include dying leaders with live threads in PROCS iterations
commit c03cd7738a83b13739f00546166969342c8ff014 upstream.

CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.

Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-09 17:52:34 +02:00
Greg Kroah-Hartman
cab4399ebf Merge 4.19.47 into android-4.19
Changes in 4.19.47
	x86: Hide the int3_emulate_call/jmp functions from UML
	ext4: do not delete unlinked inode from orphan list on failed truncate
	ext4: wait for outstanding dio during truncate in nojournal mode
	f2fs: Fix use of number of devices
	KVM: x86: fix return value for reserved EFER
	bio: fix improper use of smp_mb__before_atomic()
	sbitmap: fix improper use of smp_mb__before_atomic()
	Revert "scsi: sd: Keep disk read-only when re-reading partition"
	crypto: vmx - CTR: always increment IV as quadword
	mmc: sdhci-iproc: cygnus: Set NO_HISPD bit to fix HS50 data hold time problem
	mmc: sdhci-iproc: Set NO_HISPD bit to fix HS50 data hold time problem
	kvm: svm/avic: fix off-by-one in checking host APIC ID
	libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
	arm64/kernel: kaslr: reduce module randomization range to 2 GB
	arm64/iommu: handle non-remapped addresses in ->mmap and ->get_sgtable
	gfs2: Fix sign extension bug in gfs2_update_stats
	btrfs: don't double unlock on error in btrfs_punch_hole
	Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path
	Btrfs: avoid fallback to transaction commit during fsync of files with holes
	Btrfs: fix race between ranged fsync and writeback of adjacent ranges
	btrfs: sysfs: Fix error path kobject memory leak
	btrfs: sysfs: don't leak memory when failing add fsid
	udlfb: fix some inconsistent NULL checking
	fbdev: fix divide error in fb_var_to_videomode
	NFSv4.2 fix unnecessary retry in nfs4_copy_file_range
	NFSv4.1 fix incorrect return value in copy_file_range
	bpf: add bpf_jit_limit knob to restrict unpriv allocations
	brcmfmac: assure SSID length from firmware is limited
	brcmfmac: add subtype check for event handling in data path
	arm64: errata: Add workaround for Cortex-A76 erratum #1463225
	btrfs: honor path->skip_locking in backref code
	ovl: relax WARN_ON() for overlapping layers use case
	fbdev: fix WARNING in __alloc_pages_nodemask bug
	media: cpia2: Fix use-after-free in cpia2_exit
	media: serial_ir: Fix use-after-free in serial_ir_init_module
	media: vb2: add waiting_in_dqbuf flag
	media: vivid: use vfree() instead of kfree() for dev->bitmap_cap
	ssb: Fix possible NULL pointer dereference in ssb_host_pcmcia_exit
	bpf: devmap: fix use-after-free Read in __dev_map_entry_free
	batman-adv: mcast: fix multicast tt/tvlv worker locking
	at76c50x-usb: Don't register led_trigger if usb_register_driver failed
	acct_on(): don't mess with freeze protection
	Revert "btrfs: Honour FITRIM range constraints during free space trim"
	gfs2: Fix lru_count going negative
	cxgb4: Fix error path in cxgb4_init_module
	NFS: make nfs_match_client killable
	IB/hfi1: Fix WQ_MEM_RECLAIM warning
	gfs2: Fix occasional glock use-after-free
	mmc: core: Verify SD bus width
	tools/bpf: fix perf build error with uClibc (seen on ARC)
	selftests/bpf: set RLIMIT_MEMLOCK properly for test_libbpf_open.c
	bpftool: exclude bash-completion/bpftool from .gitignore pattern
	dmaengine: tegra210-dma: free dma controller in remove()
	net: ena: gcc 8: fix compilation warning
	hv_netvsc: fix race that may miss tx queue wakeup
	Bluetooth: Ignore CC events not matching the last HCI command
	pinctrl: zte: fix leaked of_node references
	ASoC: Intel: kbl_da7219_max98357a: Map BTN_0 to KEY_PLAYPAUSE
	usb: dwc2: gadget: Increase descriptors count for ISOC's
	usb: dwc3: move synchronize_irq() out of the spinlock protected block
	ASoC: hdmi-codec: unlock the device on startup errors
	powerpc/perf: Return accordingly on invalid chip-id in
	powerpc/boot: Fix missing check of lseek() return value
	powerpc/perf: Fix loop exit condition in nest_imc_event_init
	ASoC: imx: fix fiq dependencies
	spi: pxa2xx: fix SCR (divisor) calculation
	brcm80211: potential NULL dereference in brcmf_cfg80211_vndr_cmds_dcmd_handler()
	ACPI / property: fix handling of data_nodes in acpi_get_next_subnode()
	drm/nouveau/bar/nv50: ensure BAR is mapped
	media: stm32-dcmi: return appropriate error codes during probe
	ARM: vdso: Remove dependency with the arch_timer driver internals
	arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variable
	powerpc/watchdog: Use hrtimers for per-CPU heartbeat
	sched/cpufreq: Fix kobject memleak
	scsi: qla2xxx: Fix a qla24xx_enable_msix() error path
	scsi: qla2xxx: Fix abort handling in tcm_qla2xxx_write_pending()
	scsi: qla2xxx: Avoid that lockdep complains about unsafe locking in tcm_qla2xxx_close_session()
	scsi: qla2xxx: Fix hardirq-unsafe locking
	x86/modules: Avoid breaking W^X while loading modules
	Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve
	btrfs: fix panic during relocation after ENOSPC before writeback happens
	btrfs: Don't panic when we can't find a root key
	iwlwifi: pcie: don't crash on invalid RX interrupt
	rtc: 88pm860x: prevent use-after-free on device remove
	rtc: stm32: manage the get_irq probe defer case
	scsi: qedi: Abort ep termination if offload not scheduled
	s390/kexec_file: Fix detection of text segment in ELF loader
	sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs
	w1: fix the resume command API
	s390: qeth: address type mismatch warning
	dmaengine: pl330: _stop: clear interrupt status
	mac80211/cfg80211: update bss channel on channel switch
	libbpf: fix samples/bpf build failure due to undefined UINT32_MAX
	slimbus: fix a potential NULL pointer dereference in of_qcom_slim_ngd_register
	ASoC: fsl_sai: Update is_slave_mode with correct value
	mwifiex: prevent an array overflow
	rsi: Fix NULL pointer dereference in kmalloc
	net: cw1200: fix a NULL pointer dereference
	nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE
	nvme-rdma: fix a NULL deref when an admin connect times out
	crypto: sun4i-ss - Fix invalid calculation of hash end
	bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
	bcache: return error immediately in bch_journal_replay()
	bcache: fix failure in journal relplay
	bcache: add failure check to run_cache_set() for journal replay
	bcache: avoid clang -Wunintialized warning
	RDMA/cma: Consider scope_id while binding to ipv6 ll address
	vfio-ccw: Do not call flush_workqueue while holding the spinlock
	vfio-ccw: Release any channel program when releasing/removing vfio-ccw mdev
	x86/build: Move _etext to actual end of .text
	smpboot: Place the __percpu annotation correctly
	x86/mm: Remove in_nmi() warning from 64-bit implementation of vmalloc_fault()
	mm/uaccess: Use 'unsigned long' to placate UBSAN warnings on older GCC versions
	Bluetooth: hci_qca: Give enough time to ROME controller to bootup.
	HID: logitech-hidpp: use RAP instead of FAP to get the protocol version
	pinctrl: pistachio: fix leaked of_node references
	pinctrl: samsung: fix leaked of_node references
	clk: rockchip: undo several noc and special clocks as critical on rk3288
	perf/arm-cci: Remove broken race mitigation
	dmaengine: at_xdmac: remove BUG_ON macro in tasklet
	media: coda: clear error return value before picture run
	media: ov6650: Move v4l2_clk_get() to ov6650_video_probe() helper
	media: au0828: stop video streaming only when last user stops
	media: ov2659: make S_FMT succeed even if requested format doesn't match
	audit: fix a memory leak bug
	media: stm32-dcmi: fix crash when subdev do not expose any formats
	media: au0828: Fix NULL pointer dereference in au0828_analog_stream_enable()
	media: pvrusb2: Prevent a buffer overflow
	iio: adc: stm32-dfsdm: fix unmet direct dependencies detected
	block: fix use-after-free on gendisk
	powerpc/numa: improve control of topology updates
	powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX
	random: fix CRNG initialization when random.trust_cpu=1
	random: add a spinlock_t to struct batched_entropy
	cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
	sched/core: Check quota and period overflow at usec to nsec conversion
	sched/rt: Check integer overflow at usec to nsec conversion
	sched/core: Handle overflow in cpu_shares_write_u64
	staging: vc04_services: handle kzalloc failure
	drm/msm: a5xx: fix possible object reference leak
	irq_work: Do not raise an IPI when queueing work on the local CPU
	thunderbolt: Take domain lock in switch sysfs attribute callbacks
	s390/qeth: handle error from qeth_update_from_chp_desc()
	USB: core: Don't unbind interfaces following device reset failure
	x86/irq/64: Limit IST stack overflow check to #DB stack
	drm: etnaviv: avoid DMA API warning when importing buffers
	phy: sun4i-usb: Make sure to disable PHY0 passby for peripheral mode
	phy: mapphone-mdm6600: add gpiolib dependency
	i40e: Able to add up to 16 MAC filters on an untrusted VF
	i40e: don't allow changes to HW VLAN stripping on active port VLANs
	ACPI/IORT: Reject platform device creation on NUMA node mapping failure
	arm64: vdso: Fix clock_getres() for CLOCK_REALTIME
	RDMA/cxgb4: Fix null pointer dereference on alloc_skb failure
	perf/x86/msr: Add Icelake support
	perf/x86/intel/rapl: Add Icelake support
	perf/x86/intel/cstate: Add Icelake support
	hwmon: (vt1211) Use request_muxed_region for Super-IO accesses
	hwmon: (smsc47m1) Use request_muxed_region for Super-IO accesses
	hwmon: (smsc47b397) Use request_muxed_region for Super-IO accesses
	hwmon: (pc87427) Use request_muxed_region for Super-IO accesses
	hwmon: (f71805f) Use request_muxed_region for Super-IO accesses
	scsi: libsas: Do discovery on empty PHY to update PHY info
	mmc: core: make pwrseq_emmc (partially) support sleepy GPIO controllers
	mmc_spi: add a status check for spi_sync_locked
	mmc: sdhci-of-esdhc: add erratum eSDHC5 support
	mmc: sdhci-of-esdhc: add erratum A-009204 support
	mmc: sdhci-of-esdhc: add erratum eSDHC-A001 and A-008358 support
	drm/amdgpu: fix old fence check in amdgpu_fence_emit
	PM / core: Propagate dev->power.wakeup_path when no callbacks
	clk: rockchip: Fix video codec clocks on rk3288
	extcon: arizona: Disable mic detect if running when driver is removed
	clk: rockchip: Make rkpwm a critical clock on rk3288
	s390: zcrypt: initialize variables before_use
	x86/microcode: Fix the ancient deprecated microcode loading method
	s390/mm: silence compiler warning when compiling without CONFIG_PGSTE
	s390: cio: fix cio_irb declaration
	selftests: cgroup: fix cleanup path in test_memcg_subtree_control()
	qmi_wwan: Add quirk for Quectel dynamic config
	cpufreq: ppc_cbe: fix possible object reference leak
	cpufreq/pasemi: fix possible object reference leak
	cpufreq: pmac32: fix possible object reference leak
	cpufreq: kirkwood: fix possible object reference leak
	block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR
	x86/build: Keep local relocations with ld.lld
	drm/pl111: fix possible object reference leak
	iio: ad_sigma_delta: Properly handle SPI bus locking vs CS assertion
	iio: hmc5843: fix potential NULL pointer dereferences
	iio: common: ssp_sensors: Initialize calculated_time in ssp_common_process_data
	iio: adc: ti-ads7950: Fix improper use of mlock
	selftests/bpf: ksym_search won't check symbols exists
	rtlwifi: fix a potential NULL pointer dereference
	mwifiex: Fix mem leak in mwifiex_tm_cmd
	brcmfmac: fix missing checks for kmemdup
	b43: shut up clang -Wuninitialized variable warning
	brcmfmac: convert dev_init_lock mutex to completion
	brcmfmac: fix WARNING during USB disconnect in case of unempty psq
	brcmfmac: fix race during disconnect when USB completion is in progress
	brcmfmac: fix Oops when bringing up interface during USB disconnect
	rtc: xgene: fix possible race condition
	rtlwifi: fix potential NULL pointer dereference
	scsi: ufs: Fix regulator load and icc-level configuration
	scsi: ufs: Avoid configuring regulator with undefined voltage range
	drm/panel: otm8009a: Add delay at the end of initialization
	arm64: cpu_ops: fix a leaked reference by adding missing of_node_put
	wil6210: fix return code of wmi_mgmt_tx and wmi_mgmt_tx_ext
	x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP
	x86/uaccess, signal: Fix AC=1 bloat
	x86/ia32: Fix ia32_restore_sigcontext() AC leak
	x86/uaccess: Fix up the fixup
	chardev: add additional check for minor range overlap
	RDMA/hns: Fix bad endianess of port_pd variable
	sh: sh7786: Add explicit I/O cast to sh7786_mm_sel()
	HID: core: move Usage Page concatenation to Main item
	ASoC: eukrea-tlv320: fix a leaked reference by adding missing of_node_put
	ASoC: fsl_utils: fix a leaked reference by adding missing of_node_put
	cxgb3/l2t: Fix undefined behaviour
	HID: logitech-hidpp: change low battery level threshold from 31 to 30 percent
	spi: tegra114: reset controller on probe
	kobject: Don't trigger kobject_uevent(KOBJ_REMOVE) twice.
	media: video-mux: fix null pointer dereferences
	media: wl128x: prevent two potential buffer overflows
	media: gspca: Kill URBs on USB device disconnect
	efifb: Omit memory map check on legacy boot
	thunderbolt: property: Fix a missing check of kzalloc
	thunderbolt: Fix to check the return value of kmemdup
	timekeeping: Force upper bound for setting CLOCK_REALTIME
	scsi: qedf: Add missing return in qedf_post_io_req() in the fcport offload check
	virtio_console: initialize vtermno value for ports
	tty: ipwireless: fix missing checks for ioremap
	overflow: Fix -Wtype-limits compilation warnings
	x86/mce: Fix machine_check_poll() tests for error types
	rcutorture: Fix cleanup path for invalid torture_type strings
	x86/mce: Handle varying MCA bank counts
	rcuperf: Fix cleanup path for invalid perf_type strings
	usb: core: Add PM runtime calls to usb_hcd_platform_shutdown
	scsi: qla4xxx: avoid freeing unallocated dma memory
	scsi: lpfc: avoid uninitialized variable warning
	selinux: avoid uninitialized variable warning
	batman-adv: allow updating DAT entry timeouts on incoming ARP Replies
	dmaengine: tegra210-adma: use devm_clk_*() helpers
	hwrng: omap - Set default quality
	thunderbolt: Fix to check return value of ida_simple_get
	thunderbolt: Fix to check for kmemdup failure
	drm/amd/display: fix releasing planes when exiting odm
	thunderbolt: property: Fix a NULL pointer dereference
	e1000e: Disable runtime PM on CNP+
	tinydrm/mipi-dbi: Use dma-safe buffers for all SPI transfers
	igb: Exclude device from suspend direct complete optimization
	media: si2165: fix a missing check of return value
	media: dvbsky: Avoid leaking dvb frontend
	media: m88ds3103: serialize reset messages in m88ds3103_set_frontend
	media: staging: davinci_vpfe: disallow building with COMPILE_TEST
	drm/amd/display: Fix Divide by 0 in memory calculations
	drm/amd/display: Set stream->mode_changed when connectors change
	scsi: ufs: fix a missing check of devm_reset_control_get
	media: vimc: stream: fix thread state before sleep
	media: gspca: do not resubmit URBs when streaming has stopped
	media: go7007: avoid clang frame overflow warning with KASAN
	media: vimc: zero the media_device on probe
	scsi: lpfc: Fix FDMI manufacturer attribute value
	scsi: lpfc: Fix fc4type information for FDMI
	media: saa7146: avoid high stack usage with clang
	scsi: lpfc: Fix SLI3 commands being issued on SLI4 devices
	spi : spi-topcliff-pch: Fix to handle empty DMA buffers
	drm/omap: dsi: Fix PM for display blank with paired dss_pll calls
	spi: rspi: Fix sequencer reset during initialization
	spi: imx: stop buffer overflow in RX FIFO flush
	spi: Fix zero length xfer bug
	ASoC: davinci-mcasp: Fix clang warning without CONFIG_PM
	drm/v3d: Handle errors from IRQ setup.
	drm/drv: Hold ref on parent device during drm_device lifetime
	drm: Wake up next in drm_read() chain if we are forced to putback the event
	drm/sun4i: dsi: Change the start delay calculation
	vfio-ccw: Prevent quiesce function going into an infinite loop
	drm/sun4i: dsi: Enforce boundaries on the start delay
	NFS: Fix a double unlock from nfs_match,get_client
	Linux 4.19.47

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-05-31 08:14:29 -07:00
Roman Gushchin
4e4d5cea79 cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
[ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]

The number of descendant cgroups and the number of dying
descendant cgroups are currently synchronized using the cgroup_mutex.

The number of descendant cgroups will be required by the cgroup v2
freezer, which will use it to determine if a cgroup is frozen
(depending on total number of descendants and number of frozen
descendants). It's not always acceptable to grab the cgroup_mutex,
especially from quite hot paths (e.g. exit()).

To avoid this, let's additionally synchronize these counters using
the css_set_lock.

So, it's safe to read these counters with either cgroup_mutex or
css_set_lock locked, and for changing both locks should be acquired.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31 06:46:19 -07:00
Greg Kroah-Hartman
d885da678e Merge 4.19.34 into android-4.19
Changes in 4.19.34
	arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
	ext4: cleanup bh release code in ext4_ind_remove_space()
	tty/serial: atmel: Add is_half_duplex helper
	tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
	CIFS: fix POSIX lock leak and invalid ptr deref
	h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
	f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
	f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
	tracing: kdb: Fix ftdump to not sleep
	net/mlx5: Avoid panic when setting vport rate
	net/mlx5: Avoid panic when setting vport mac, getting vport config
	gpio: gpio-omap: fix level interrupt idling
	include/linux/relay.h: fix percpu annotation in struct rchan
	sysctl: handle overflow for file-max
	net: stmmac: Avoid sometimes uninitialized Clang warnings
	enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
	libbpf: force fixdep compilation at the start of the build
	scsi: hisi_sas: Set PHY linkrate when disconnected
	scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
	iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
	x86/hyperv: Fix kernel panic when kexec on HyperV
	perf c2c: Fix c2c report for empty numa node
	mm/sparse: fix a bad comparison
	mm/cma.c: cma_declare_contiguous: correct err handling
	mm/page_ext.c: fix an imbalance with kmemleak
	mm, swap: bounds check swap_info array accesses to avoid NULL derefs
	mm,oom: don't kill global init via memory.oom.group
	memcg: killed threads should not invoke memcg OOM killer
	mm, mempolicy: fix uninit memory access
	mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
	mm/slab.c: kmemleak no scan alien caches
	ocfs2: fix a panic problem caused by o2cb_ctl
	f2fs: do not use mutex lock in atomic context
	fs/file.c: initialize init_files.resize_wait
	page_poison: play nicely with KASAN
	cifs: use correct format characters
	dm thin: add sanity checks to thin-pool and external snapshot creation
	f2fs: fix to check inline_xattr_size boundary correctly
	cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
	cifs: Fix NULL pointer dereference of devname
	netfilter: nf_tables: check the result of dereferencing base_chain->stats
	netfilter: conntrack: tcp: only close if RST matches exact sequence
	jbd2: fix invalid descriptor block checksum
	fs: fix guard_bio_eod to check for real EOD errors
	tools lib traceevent: Fix buffer overflow in arg_eval
	PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
	wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
	mt76: fix a leaked reference by adding a missing of_node_put
	crypto: crypto4xx - add missing of_node_put after of_device_is_available
	crypto: cavium/zip - fix collision with generic cra_driver_name
	usb: chipidea: Grab the (legacy) USB PHY by phandle first
	powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
	scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
	kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
	powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
	coresight: etm4x: Add support to enable ETMv4.2
	serial: 8250_pxa: honor the port number from devicetree
	ARM: 8840/1: use a raw_spinlock_t in unwind
	iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
	powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
	btrfs: qgroup: Make qgroup async transaction commit more aggressive
	mmc: omap: fix the maximum timeout setting
	net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
	e1000e: Fix -Wformat-truncation warnings
	mlxsw: spectrum: Avoid -Wformat-truncation warnings
	platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
	platform/mellanox: mlxreg-hotplug: Fix KASAN warning
	loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
	IB/mlx4: Increase the timeout for CM cache
	clk: fractional-divider: check parent rate only if flag is set
	perf annotate: Fix getting source line failure
	ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
	cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
	efi: cper: Fix possible out-of-bounds access
	s390/ism: ignore some errors during deregistration
	scsi: megaraid_sas: return error when create DMA pool failed
	scsi: fcoe: make use of fip_mode enum complete
	drm/amd/display: Clear stream->mode_changed after commit
	perf test: Fix failure of 'evsel-tp-sched' test on s390
	mwifiex: don't advertise IBSS features without FW support
	perf report: Don't shadow inlined symbol with different addr range
	SoC: imx-sgtl5000: add missing put_device()
	media: ov7740: fix runtime pm initialization
	media: sh_veu: Correct return type for mem2mem buffer helpers
	media: s5p-jpeg: Correct return type for mem2mem buffer helpers
	media: rockchip/rga: Correct return type for mem2mem buffer helpers
	media: s5p-g2d: Correct return type for mem2mem buffer helpers
	media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
	media: mtk-jpeg: Correct return type for mem2mem buffer helpers
	mt76: usb: do not run mt76u_queues_deinit twice
	xen/gntdev: Do not destroy context while dma-bufs are in use
	vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
	HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
	cgroup, rstat: Don't flush subtree root unless necessary
	jbd2: fix race when writing superblock
	leds: lp55xx: fix null deref on firmware load failure
	perf report: Add s390 diagnosic sampling descriptor size
	iwlwifi: pcie: fix emergency path
	ACPI / video: Refactor and fix dmi_is_desktop()
	selftests: skip seccomp get_metadata test if not real root
	kprobes: Prohibit probing on bsearch()
	kprobes: Prohibit probing on RCU debug routine
	netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
	ARM: 8833/1: Ensure that NEON code always compiles with Clang
	ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
	ALSA: PCM: check if ops are defined before suspending PCM
	ath10k: fix shadow register implementation for WCN3990
	usb: f_fs: Avoid crash due to out-of-scope stack ptr access
	sched/topology: Fix percpu data types in struct sd_data & struct s_data
	bcache: fix input overflow to cache set sysfs file io_error_halflife
	bcache: fix input overflow to sequential_cutoff
	bcache: fix potential div-zero error of writeback_rate_i_term_inverse
	bcache: improve sysfs_strtoul_clamp()
	genirq: Avoid summation loops for /proc/stat
	net: marvell: mvpp2: fix stuck in-band SGMII negotiation
	iw_cxgb4: fix srqidx leak during connection abort
	net: phy: consider latched link-down status in polling mode
	fbdev: fbmem: fix memory access if logo is bigger than the screen
	cdrom: Fix race condition in cdrom_sysctl_register
	drm: rcar-du: add missing of_node_put
	drm/amd/display: Don't re-program planes for DPMS changes
	drm/amd/display: Disconnect mpcc when changing tg
	perf/aux: Make perf_event accessible to setup_aux()
	e1000e: fix cyclic resets at link up with active tx
	e1000e: Exclude device from suspend direct complete optimization
	platform/x86: intel_pmc_core: Fix PCH IP sts reading
	i2c: of: Try to find an I2C adapter matching the parent
	staging: spi: mt7621: Add return code check on device_reset()
	iwlwifi: mvm: fix RFH config command with >=10 CPUs
	ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
	sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
	efi/memattr: Don't bail on zero VA if it equals the region's PA
	sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
	drm/vkms: Bugfix extra vblank frame
	ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
	efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
	soc: qcom: gsbi: Fix error handling in gsbi_probe()
	mt7601u: bump supported EEPROM version
	ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
	ARM: avoid Cortex-A9 livelock on tight dmb loops
	block, bfq: fix in-service-queue check for queue merging
	bpf: fix missing prototype warnings
	selftests/bpf: skip verifier tests for unsupported program types
	powerpc/64s: Clear on-stack exception marker upon exception return
	cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
	backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
	tty: increase the default flip buffer limit to 2*640K
	powerpc/pseries: Perform full re-add of CPU for topology update post-migration
	drm/amd/display: Enable vblank interrupt during CRC capture
	ALSA: dice: add support for Solid State Logic Duende Classic/Mini
	usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
	platform/x86: intel-hid: Missing power button release on some Dell models
	perf script python: Use PyBytes for attr in trace-event-python
	perf script python: Add trace_context extension module to sys.modules
	media: mt9m111: set initial frame size other than 0x0
	hwrng: virtio - Avoid repeated init of completion
	soc/tegra: fuse: Fix illegal free of IO base address
	HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
	f2fs: UBSAN: set boolean value iostat_enable correctly
	hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
	cpu/hotplug: Mute hotplug lockdep during init
	dmaengine: imx-dma: fix warning comparison of distinct pointer types
	dmaengine: qcom_hidma: assign channel cookie correctly
	dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
	netfilter: physdev: relax br_netfilter dependency
	media: rcar-vin: Allow independent VIN link enablement
	media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
	regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
	pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
	drm: Auto-set allow_fb_modifiers when given modifiers at plane init
	drm/nouveau: Stop using drm_crtc_force_disable
	x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
	selinux: do not override context on context mounts
	brcmfmac: Use firmware_request_nowarn for the clm_blob
	wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
	x86/build: Mark per-CPU symbols as absolute explicitly for LLD
	drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
	clk: meson: clean-up clock registration
	clk: rockchip: fix frac settings of GPLL clock for rk3328
	dmaengine: tegra: avoid overflow of byte tracking
	Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
	drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
	net: stmmac: Avoid one more sometimes uninitialized Clang warning
	ACPI / video: Extend chassis-type detection with a "Lunch Box" check
	bcache: fix potential div-zero error of writeback_rate_p_term_inverse
	kprobes/x86: Blacklist non-attachable interrupt functions
	Linux 4.19.34

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-05 22:43:09 +02:00
Oleg Nesterov
d0bc74c563 cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
[ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]

The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.

However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does

	for (;;) {
		int pid = fork();
		assert(pid >= 0);
		if (pid)
			wait(NULL);
		else
			exit(0);
	}

can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().

Test-case:

	mkdir -p /tmp/CG
	mount -t cgroup2 none /tmp/CG
	echo '+pids' > /tmp/CG/cgroup.subtree_control

	mkdir /tmp/CG/PID
	echo 2 > /tmp/CG/PID/pids.max

	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
	echo $! > /tmp/CG/PID/cgroup.procs

Without this patch the forking process fails soon after migration.

Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).

Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:33:13 +02:00
Johannes Weiner
9f79143ebb UPSTREAM: kernel: cgroup: add poll file operation
Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes.  To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

(cherry picked from commit: dc50537bdd1a0804fa2cbc990565ee9a944e66fa)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Idc648e7b7b7bd5fc00c7b32163e55a93b0f49a98
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:28 -07:00
Johannes Weiner
dc9cd29ded UPSTREAM: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 2ce7135adc9ad081aa3c49744144376ac74fea60)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I163e6657aaa60aa5aab9372616a3bce2a65e90ec
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Tejun Heo
479adb89a9 cgroup: Fix dom_cgrp propagation when enabling threaded mode
A cgroup which is already a threaded domain may be converted into a
threaded cgroup if the prerequisite conditions are met.  When this
happens, all threaded descendant should also have their ->dom_cgrp
updated to the new threaded domain cgroup.  Unfortunately, this
propagation was missing leading to the following failure.

  # cd /sys/fs/cgroup/unified
  # cat cgroup.subtree_control    # show that no controllers are enabled

  # mkdir -p mycgrp/a/b/c
  # echo threaded > mycgrp/a/b/cgroup.type

  At this point, the hierarchy looks as follows:

      mycgrp [d]
	  a [dt]
	      b [t]
		  c [inv]

  Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

  # echo threaded > mycgrp/a/cgroup.type

  By this point, we now have a hierarchy that looks as follows:

      mycgrp [dt]
	  a [t]
	      b [t]
		  c [inv]

  But, when we try to convert the node "c" from "domain invalid" to
  "threaded", we get ENOTSUP on the write():

  # echo threaded > mycgrp/a/b/c/cgroup.type
  sh: echo: write error: Operation not supported

This patch fixes the problem by

* Moving the opencoded ->dom_cgrp save and restoration in
  cgroup_enable_threaded() into cgroup_{save|restore}_control() so
  that mulitple cgroups can be handled.

* Updating all threaded descendants' ->dom_cgrp to point to the new
  dom_cgrp when enabling threaded mode.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Reported-by: Amin Jamali <ajamali@pivotal.io>
Reported-by: Joao De Almeida Pereira <jpereira@pivotal.io>
Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
Fixes: 454000adaa ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
Cc: stable@vger.kernel.org # v4.14+
2018-10-04 13:28:08 -07:00
Josef Bacik
d09d8df3a2 blkcg: add generic throttling mechanism
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies.  The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources.  Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.

This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Tejun Heo
8f53470bab cgroup: Add cgroup_subsys->css_rstat_flush()
This patch adds cgroup_subsys->css_rstat_flush().  If a subsystem has
this callback, its csses are linked on cgrp->css_rstat_list and rstat
will call the function whenever the associated cgroup is flushed.
Flush is also performed when such csses are released so that residual
counts aren't lost.

Combined with the rstat API previous patches factored out, this allows
controllers to plug into rstat to manage their statistics in a
scalable way.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:05 -07:00
Tejun Heo
d4ff749b5e cgroup: Distinguish base resource stat implementation from rstat
Base resource stat accounts universial (not specific to any
controller) resource consumptions on top of rstat.  Currently, its
implementation is intermixed with rstat implementation making the code
confusing to follow.

This patch clarifies the distintion by doing the followings.

* Encapsulate base resource stat counters, currently only cputime, in
  struct cgroup_base_stat.

* Move prev_cputime into struct cgroup and initialize it with cgroup.

* Rename the related functions so that they start with cgroup_base_stat.

* Prefix the related variables and field names with b.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Tejun Heo
c58632b363 cgroup: Rename stat to rstat
stat is too generic a name and ends up causing subtle confusions.
It'll be made generic so that controllers can plug into it, which will
make the problem worse.  Let's rename it to something more specific -
cgroup_rstat for cgroup recursive stat.

This patch does the following renames.  No other changes.

* cpu_stat	-> rstat_cpu
* stat		-> rstat
* ?cstat	-> ?rstatc

Note that the renames are selective.  The unrenamed are the ones which
implement basic resource statistics on top of rstat.  This will be
further cleaned up in the following patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Tejun Heo
b12e358328 cgroup: Limit event generation frequency
".events" files generate file modified event to notify userland of
possible new events.  Some of the events can be quite bursty
(e.g. memory high event) and generating notification each time is
costly and pointless.

This patch implements a event rate limit mechanism.  If a new
notification is requested before 10ms has passed since the previous
notification, the new notification is delayed till then.

As this only delays from the second notification on in a given close
cluster of notifications, userland reactions to notifications
shouldn't be delayed at all in most cases while avoiding notification
storms.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-04-26 14:29:04 -07:00
Linus Torvalds
d92cd810e6 Merge branch 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "rcu_work addition and a couple trivial changes"

* 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: remove the comment about the old manager_arb mutex
  workqueue: fix the comments of nr_idle
  fs/aio: Use rcu_work instead of explicit rcu and work item
  cgroup: Use rcu_work instead of explicit rcu and work item
  RCU, workqueue: Implement rcu_work
2018-04-03 18:00:13 -07:00
Tejun Heo
8f36aaec9c cgroup: Use rcu_work instead of explicit rcu and work item
Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
item bouncing.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-03-19 10:12:03 -07:00
Eric Dumazet
4dcb31d464 net: use skb_to_full_sk() in skb_update_prio()
Andrei Vagin reported a KASAN: slab-out-of-bounds error in
skb_update_prio()

Since SYNACK might be attached to a request socket, we need to
get back to the listener socket.
Since this listener is manipulated without locks, add const
qualifiers to sock_cgroup_prioidx() so that the const can also
be used in skb_update_prio()

Also add the const qualifier to sock_cgroup_classid() for consistency.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-14 12:53:23 -04:00
Matt Roper
392536b731 cgroup: Update documentation reference
The cgroup_subsys structure references a documentation file that has been
renamed after the v1/v2 split.  Since the v2 documentation doesn't
currently contain any information on kernel interfaces for controllers,
point the user to the v1 docs.

Cc: Tejun Heo <tj@kernel.org>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-01-02 06:59:52 -08:00
Linus Torvalds
22714a2ba4 Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Cgroup2 cpu controller support is finally merged.

   - Basic cpu statistics support to allow monitoring by default without
     the CPU controller enabled.

   - cgroup2 cpu controller support.

   - /sys/kernel/cgroup files to help dealing with new / optional
     features"

* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: export list of cgroups v2 features using sysfs
  cgroup: export list of delegatable control files using sysfs
  cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
  MAINTAINERS: relocate cpuset.c
  cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
  sched: Implement interface for cgroup unified hierarchy
  sched: Misc preps for cgroup unified hierarchy interface
  sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
  cgroup: statically initialize init_css_set->dfl_cgrp
  cgroup: Implement cgroup2 basic CPU usage accounting
  cpuacct: Introduce cgroup_account_cputime[_field]()
  sched/cputime: Expose cputime_adjust()
2017-11-15 14:29:44 -08:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Tejun Heo
d41bf8c9de cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
The basic cpu stat is currently shown with "cpu." prefix in
cgroup.stat, and the same information is duplicated in cpu.stat when
cpu controller is enabled.  This is ugly and not very scalable as we
want to expand the coverage of stat information which is always
available.

This patch makes cgroup core always create "cpu.stat" file and show
the basic cpu stat there and calls the cpu controller to show the
extra stats when enabled.  This ensures that the same information
isn't presented in multiple places and makes future expansion of basic
stats easier.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2017-10-26 10:56:33 -07:00
Tejun Heo
041cd640b2 cgroup: Implement cgroup2 basic CPU usage accounting
In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on.  cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.

cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default.  While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense.  Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.

This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.

* All accountings are done per-cpu and don't propagate immediately.
  It just bumps the per-cgroup per-cpu counters and links to the
  parent's updated list if not already on it.

* On a read, the per-cpu counters are collected into the global ones
  and then propagated upwards.  Only the per-cpu counters which have
  changed since the last read are propagated.

* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
  prefix.  Total usage is collected from scheduling events.  User/sys
  breakdown is sourced from tick sampling and adjusted to the usage
  using cputime_adjust().

This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).

v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
2017-09-25 08:12:05 -07:00
Waiman Long
e1cba4b85d cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs
filesystem to enable cpuset controller to use v2 behavior in a v1
cgroup. This mount option applies only to cpuset controller and have
no effect on other controllers.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-08-18 08:24:21 -07:00
Roman Gushchin
1a926e0bba cgroup: implement hierarchy limits
Creating cgroup hierearchies of unreasonable size can affect
overall system performance. A user might want to limit the
size of cgroup hierarchy. This is especially important if a user
is delegating some cgroup sub-tree.

To address this issue, introduce an ability to control
the size of cgroup hierarchy.

The cgroup.max.descendants control file allows to set the maximum
allowed number of descendant cgroups.
The cgroup.max.depth file controls the maximum depth of the cgroup
tree. Both are single value r/w files, with "max" default value.

The control files exist on each hierarchy level (including root).
When a new cgroup is created, we check the total descendants
and depth limits on each level, and if none of them are exceeded,
a new cgroup is created.

Only alive cgroups are counted, removed (dying) cgroups are
ignored.

Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
2017-08-02 12:05:20 -07:00
Roman Gushchin
0679dee03c cgroup: keep track of number of descent cgroups
Keep track of the number of online and dying descent cgroups.

This data will be used later to add an ability to control cgroup
hierarchy (limit the depth and the number of descent cgroups)
and display hierarchy stats.

Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
2017-08-02 12:05:19 -07:00
Tejun Heo
8cfd8147df cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.

The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint.  Note that the threads aren't allowed
to escape to a different threaded subtree.  To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.

The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree.  This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted.  This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.

As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.

Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.

v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
      Spotted by Waiman.
    - Documentation updated as suggested by Waiman.
    - cgroup.type content slightly reformatted.
    - Mark the debug controller threaded.

v4: - Updated to the general idea of marking specific cgroups
      domain/threaded as suggested by PeterZ.

v3: - Dropped "join" and always make mixed children join the parent's
      threaded subtree.

v2: - After discussions with Waiman, support for mixed thread mode is
      added.  This should address the issue that Peter pointed out
      where any nesting should be avoided for thread subtrees while
      coexisting with other domain cgroups.
    - Enabling / disabling thread mode now piggy backs on the existing
      control mask update mechanism.
    - Bug fixes and cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 11:14:51 -04:00
Tejun Heo
454000adaa cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
cgroup v2 is in the process of growing thread granularity support.  A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.

The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged.  Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.

This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.

* cgroup->dom_cgrp points to self for normal and thread roots.  For
  proper thread subtree members, points to the dom_cgrp (the thread
  root).

* css_set->dom_cset points to self if for normal and thread roots.  If
  threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
  The dom_cgrp serves as the resource domain and keeps the matching
  csses available.  The dom_cset holds those csses and makes them
  easily accessible.

* All threaded csets are linked on their dom_csets to enable iteration
  of all threaded tasks.

* cgroup->nr_threaded_children keeps track of the number of threaded
  children.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

v4: ->nr_threaded_children added.

v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
    enable-threaded-per-cgroup behavior.

v2: Added cgroup_is_threaded() helper.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21 11:14:51 -04:00
Tejun Heo
788b950c62 cgroup: distinguish local and children populated states
cgrp->populated_cnt counts both local (the cgroup's populated
css_sets) and subtree proper (populated children) so that it's only
zero when the whole subtree, including self, is empty.

This patch splits the counter into two so that local and children
populated states are tracked separately.  It allows finer-grained
tests on the state of the hierarchy which will be used to replace
css_set walking local populated test.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-16 21:44:42 -04:00
Tejun Heo
5136f6365c cgroup: implement "nsdelegate" mount option
Currently, cgroup only supports delegation to !root users and cgroup
namespaces don't get any special treatments.  This limits the
usefulness of cgroup namespaces as they by themselves can't be safe
delegation boundaries.  A process inside a cgroup can change the
resource control knobs of the parent in the namespace root and may
move processes in and out of the namespace if cgroups outside its
namespace are visible somehow.

This patch adds a new mount option "nsdelegate" which makes cgroup
namespaces delegation boundaries.  If set, cgroup behaves as if write
permission based delegation took place at namespace boundaries -
writes to the resource control knobs from the namespace root are
denied and migration crossing the namespace boundary aren't allowed
from inside the namespace.

This allows cgroup namespace to function as a delegation boundary by
itself.

v2: Silently ignore nsdelegate specified on !init mounts.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aravind Anbudurai <aru7@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
2017-06-28 14:45:21 -04:00
Waiman Long
73a7242a06 cgroup: Keep accurate count of tasks in each css_set
The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable nr_tasks is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.

tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
    Refreshed on top of cgroup/for-v4.13 which dropped on
    css_set_populated() -> nr_tasks conversion.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-14 16:01:21 -04:00
Waiman Long
33c35aa481 cgroup: Prevent kill_css() from being called more than once
The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+
2017-05-17 16:58:32 -04:00
Todd Poynor
b8b1a2e5ec cgroup: move cgroup_subsys_state parent field for cache locality
Various structures embed a struct cgroup_subsys_state, typically at
the top of the containing structure.  It is common for code that
accesses the structures to perform operations that iterate over the
chain of parent css pointers, also accessing data in each containing
structure.  In particular, struct cpuacct is used by fairly hot code
paths in the scheduler such as cpuacct_charge().

Move the parent css pointer field to the end of the structure to
increase the chances of residing in the same cache line as the data
from the containing structure.

Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-11 09:06:17 +09:00
Elena Reshetova
4b9502e63b kernel: convert css_set.refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-08 17:46:03 -05:00
Ingo Molnar
780de9dd27 sched/headers, cgroups: Remove the threadgroup_change_*() wrappery
threadgroup_change_begin()/end() is a pointless wrapper around
cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
in the !CONFIG_CGROUPS=y case.

Remove the wrappery, move the might_sleep() (the down_read()
already has a might_sleep() check).

This debloats <linux/sched.h> a bit and simplifies this API.

Update all call sites.

No change in functionality.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:25 +01:00
Tejun Heo
5f617ebbdf cgroup: reorder css_set fields
Reorder css_set fields so that they're roughly in the order of how hot
they are.  The rough order is

1. the actual csses
2. reference counter and the default cgroup pointer.
3. task lists and iterations
4. fields used during merge including css_set lookup
5. the rest

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:05 -05:00
Tejun Heo
e90cbebc3f cgroup add cftype->open/release() callbacks
Pipe the newly added kernfs->open/release() callbacks through cftype.
While at it, as cleanup operations now can be performed from
->release() instead of ->seq_stop(), make the latter optional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
2016-12-27 14:49:03 -05:00
Daniel Mack
3007098494 cgroup: add support for eBPF programs
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 16:25:52 -05:00
Tejun Heo
5cf1cacb49 cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
Since e93ad19d05 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
2016-04-25 15:45:14 -04:00
Tejun Heo
2b021cbf3c cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks.  As init_css_set is always valid,
this worked fine.

However, after 2e91fa7f6d, zombie tasks stay with the css_set it was
associated with at the time of death.  Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
becomes a zombie, it would still remain associated with A and B.  If A
only contains zombie tasks, it can be removed.  On removal, A gets
marked offline but stays pinned until all zombies are drained.  At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.

 WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
 CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
 ...
 Call Trace:
  [<ffffffff8127e63c>] dump_stack+0x4e/0x82
  [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
  [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
  [<ffffffff810c33e1>] cgroup_get+0x121/0x160
  [<ffffffff810c349b>] link_css_set+0x7b/0x90
  [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
  [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
  [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
  [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
  [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
  [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
  [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
  [<ffffffff81151673>] __vfs_write+0x23/0xe0
  [<ffffffff81152494>] vfs_write+0xa4/0x1a0
  [<ffffffff811532d4>] SyS_write+0x44/0xa0
  [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration.  This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 13:31:46 -07:00
Tejun Heo
f6d635ad34 cgroup: implement cgroup_subsys->implicit_on_dfl
Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control".  For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.

This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.

An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.

v2: Reimplemented on top of the recent updates to css handling and
    subsystem rebinding.  Rebinding implicit subsystems is now a
    simple matter of exempting it from the busy subsystem check.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo
e4857982f4 cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.

This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup.  This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting.  While this works, it's subtle
and clunky.

As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set.  This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst().  This also makes migration code
ready for arbitrary multi-target migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00