Changes in 4.19.325
netlink: terminate outstanding dump on socket close
ocfs2: uncache inode which has failed entering the group
nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
ocfs2: fix UBSAN warning in ocfs2_verify_volume()
nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
Revert "mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K"
media: dvbdev: fix the logic when DVB_DYNAMIC_MINORS is not set
kbuild: Use uname for LINUX_COMPILE_HOST detection
mm: revert "mm: shmem: fix data-race in shmem_getattr()"
ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet
mac80211: fix user-power when emulating chanctx
selftests/watchdog-test: Fix system accidentally reset after watchdog-test
x86/amd_nb: Fix compile-testing without CONFIG_AMD_NB
net: usb: qmi_wwan: add Quectel RG650V
proc/softirqs: replace seq_printf with seq_put_decimal_ull_width
nvme: fix metadata handling in nvme-passthrough
initramfs: avoid filename buffer overrun
m68k: mvme147: Fix SCSI controller IRQ numbers
m68k: mvme16x: Add and use "mvme16x.h"
m68k: mvme147: Reinstate early console
acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block()
s390/syscalls: Avoid creation of arch/arch/ directory
hfsplus: don't query the device logical block size multiple times
EDAC/fsl_ddr: Fix bad bit shift operations
crypto: pcrypt - Call crypto layer directly when padata_do_parallel() return -EBUSY
crypto: cavium - Fix the if condition to exit loop after timeout
crypto: bcm - add error check in the ahash_hmac_init function
crypto: cavium - Fix an error handling path in cpt_ucode_load_fw()
time: Fix references to _msecs_to_jiffies() handling of values
soc: qcom: geni-se: fix array underflow in geni_se_clk_tbl_get()
mmc: mmc_spi: drop buggy snprintf()
ARM: dts: cubieboard4: Fix DCDC5 regulator constraints
regmap: irq: Set lockdep class for hierarchical IRQ domains
firmware: arm_scpi: Check the DVFS OPP count returned by the firmware
drm/mm: Mark drm_mm_interval_tree*() functions with __maybe_unused
wifi: ath9k: add range check for conn_rsp_epid in htc_connect_service()
drm/omap: Fix locking in omap_gem_new_dmabuf()
bpf: Fix the xdp_adjust_tail sample prog issue
wifi: mwifiex: Fix memcpy() field-spanning write warning in mwifiex_config_scan()
drm/etnaviv: consolidate hardware fence handling in etnaviv_gpu
drm/etnaviv: dump: fix sparse warnings
drm/etnaviv: fix power register offset on GC300
drm/etnaviv: hold GPU lock across perfmon sampling
net: rfkill: gpio: Add check for clk_enable()
ALSA: us122l: Use snd_card_free_when_closed() at disconnection
ALSA: caiaq: Use snd_card_free_when_closed() at disconnection
ALSA: 6fire: Release resources at card release
netpoll: Use rcu_access_pointer() in netpoll_poll_lock
trace/trace_event_perf: remove duplicate samples on the first tracepoint event
powerpc/vdso: Flag VDSO64 entry points as functions
mfd: da9052-spi: Change read-mask to write-mask
cpufreq: loongson2: Unregister platform_driver on failure
mtd: rawnand: atmel: Fix possible memory leak
RDMA/bnxt_re: Check cqe flags to know imm_data vs inv_irkey
mfd: rt5033: Fix missing regmap_del_irq_chip()
scsi: bfa: Fix use-after-free in bfad_im_module_exit()
scsi: fusion: Remove unused variable 'rc'
scsi: qedi: Fix a possible memory leak in qedi_alloc_and_init_sb()
ocfs2: fix uninitialized value in ocfs2_file_read_iter()
powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static
fbdev/sh7760fb: Alloc DMA memory from hardware device
fbdev: sh7760fb: Fix a possible memory leak in sh7760fb_alloc_mem()
dt-bindings: clock: adi,axi-clkgen: convert old binding to yaml format
dt-bindings: clock: axi-clkgen: include AXI clk
clk: axi-clkgen: use devm_platform_ioremap_resource() short-hand
clk: clk-axi-clkgen: make sure to enable the AXI bus clock
perf probe: Correct demangled symbols in C++ program
PCI: cpqphp: Use PCI_POSSIBLE_ERROR() to check config reads
PCI: cpqphp: Fix PCIBIOS_* return value confusion
m68k: mcfgpio: Fix incorrect register offset for CONFIG_M5441x
m68k: coldfire/device.c: only build FEC when HW macros are defined
rpmsg: glink: Add TX_DATA_CONT command while sending
rpmsg: glink: Send READ_NOTIFY command in FIFO full case
rpmsg: glink: Fix GLINK command prefix
rpmsg: glink: use only lower 16-bits of param2 for CMD_OPEN name length
NFSD: Prevent NULL dereference in nfsd4_process_cb_update()
NFSD: Cap the number of bytes copied by nfs4_reset_recoverydir()
vfio/pci: Properly hide first-in-list PCIe extended capability
power: supply: core: Remove might_sleep() from power_supply_put()
net: usb: lan78xx: Fix memory leak on device unplug by freeing PHY device
tg3: Set coherent DMA mask bits to 31 for BCM57766 chipsets
net: usb: lan78xx: Fix refcounting and autosuspend on invalid WoL configuration
marvell: pxa168_eth: fix call balance of pep->clk handling routines
net: stmmac: dwmac-socfpga: Set RX watchdog interrupt as broken
usb: using mutex lock and supporting O_NONBLOCK flag in iowarrior_read()
USB: chaoskey: fail open after removal
USB: chaoskey: Fix possible deadlock chaoskey_list_lock
misc: apds990x: Fix missing pm_runtime_disable()
apparmor: fix 'Do simple duplicate message elimination'
usb: ehci-spear: fix call balance of sehci clk handling routines
ext4: supress data-race warnings in ext4_free_inodes_{count,set}()
ext4: fix FS_IOC_GETFSMAP handling
jfs: xattr: check invalid xattr size more strictly
ASoC: codecs: Fix atomicity violation in snd_soc_component_get_drvdata()
PCI: Fix use-after-free of slot->bus on hot remove
tty: ldsic: fix tty_ldisc_autoload sysctl's proc_handler
Bluetooth: Fix type of len in rfcomm_sock_getsockopt{,_old}()
ALSA: usb-audio: Fix potential out-of-bound accesses for Extigy and Mbox devices
Revert "usb: gadget: composite: fix OS descriptors w_value logic"
serial: sh-sci: Clean sci_ports[0] after at earlycon exit
Revert "serial: sh-sci: Clean sci_ports[0] after at earlycon exit"
netfilter: ipset: add missing range check in bitmap_ip_uadt
spi: Fix acpi deferred irq probe
ubi: wl: Put source PEB into correct list if trying locking LEB failed
um: ubd: Do not use drvdata in release
um: net: Do not use drvdata in release
serial: 8250: omap: Move pm_runtime_get_sync
um: vector: Do not use drvdata in release
sh: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
block: fix ordering between checking BLK_MQ_S_STOPPED request adding
HID: wacom: Interpret tilt data from Intuos Pro BT as signed values
media: wl128x: Fix atomicity violation in fmc_send_cmd()
usb: dwc3: gadget: Fix checking for number of TRBs left
lib: string_helpers: silence snprintf() output truncation warning
NFSD: Prevent a potential integer overflow
rpmsg: glink: Propagate TX failures in intentless mode as well
um: Fix the return value of elf_core_copy_task_fpregs
NFSv4.0: Fix a use-after-free problem in the asynchronous open()
rtc: check if __rtc_read_time was successful in rtc_timer_do_work()
ubifs: Correct the total block count by deducting journal reservation
ubi: fastmap: Fix duplicate slab cache names while attaching
jffs2: fix use of uninitialized variable
block: return unsigned int from bdev_io_min
9p/xen: fix init sequence
9p/xen: fix release of IRQ
modpost: remove incorrect code in do_eisa_entry()
sh: intc: Fix use-after-free bug in register_intc_controller()
Linux 4.19.325
Change-Id: I50250c8bd11f9ff4b40da75225c1cfb060e0c258
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 96a9fe64bfd486ebeeacf1e6011801ffe89dae18 upstream.
Supposing first scenario with a virtio_blk driver.
CPU0 CPU1
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_request_bypass_insert() 1) store
blk_mq_start_stopped_hw_queue()
clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending())
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
return
__blk_mq_sched_dispatch_requests()
Supposing another scenario.
CPU0 CPU1
blk_mq_requeue_work()
blk_mq_insert_request() 1) store
virtblk_done()
blk_mq_start_stopped_hw_queue()
blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
continue
blk_mq_run_hw_queue()
Both scenarios are similar, the full memory barrier should be inserted
between 1) and 2), as well as between 3) and 4) to make sure that either
CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list.
Otherwise, either CPU will not rerun the hardware queue causing
starvation of the request.
The easy way to fix it is to add the essential full memory barrier into
helper of blk_mq_hctx_stopped(). In order to not affect the fast path
(hardware queue is not stopped most of the time), we only insert the
barrier into the slow path. Actually, only slow path needs to care about
missing of dispatching the request to the low-level device driver.
Fixes: 320ae51fee ("blk-mq: new multi-queue block IO queueing mechanism")
Cc: stable@vger.kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.19.307
PCI: mediatek: Clear interrupt status before dispatching handler
include/linux/units.h: add helpers for kelvin to/from Celsius conversion
units: Add Watt units
units: change from 'L' to 'UL'
units: add the HZ macros
serial: sc16is7xx: set safe default SPI clock frequency
driver core: add device probe log helper
spi: introduce SPI_MODE_X_MASK macro
serial: sc16is7xx: add check for unsupported SPI modes during probe
ext4: allow for the last group to be marked as trimmed
crypto: api - Disallow identical driver names
PM: hibernate: Enforce ordering during image compression/decompression
hwrng: core - Fix page fault dead lock on mmap-ed hwrng
rpmsg: virtio: Free driver_override when rpmsg_remove()
parisc/firmware: Fix F-extend for PDC addresses
nouveau/vmm: don't set addr on the fail path to avoid warning
block: Remove special-casing of compound pages
powerpc: Use always instead of always-y in for crtsavres.o
x86/CPU/AMD: Fix disabling XSAVES on AMD family 0x17 due to erratum
driver core: Annotate dev_err_probe() with __must_check
Revert "driver core: Annotate dev_err_probe() with __must_check"
driver code: print symbolic error code
drivers: core: fix kernel-doc markup for dev_err_probe()
net/smc: fix illegal rmb_desc access in SMC-D connection dump
vlan: skip nested type that is not IFLA_VLAN_QOS_MAPPING
llc: make llc_ui_sendmsg() more robust against bonding changes
llc: Drop support for ETH_P_TR_802_2.
net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recv
tracing: Ensure visibility when inserting an element into tracing_map
tcp: Add memory barrier to tcp_push()
netlink: fix potential sleeping issue in mqueue_flush_file
net/mlx5: Use kfree(ft->g) in arfs_create_groups()
net/mlx5e: fix a double-free in arfs_create_groups
netfilter: nf_tables: restrict anonymous set and map names to 16 bytes
fjes: fix memleaks in fjes_hw_setup
net: fec: fix the unhandled context fault from smmu
btrfs: don't warn if discard range is not aligned to sector
btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
netfilter: nf_tables: reject QUEUE/DROP verdict parameters
gpiolib: acpi: Ignore touchpad wakeup on GPD G1619-04
drm: Don't unref the same fb many times by mistake due to deadlock handling
drm/bridge: nxp-ptn3460: fix i2c_master_send() error checking
drm/bridge: nxp-ptn3460: simplify some error checking
drm/exynos: gsc: minor fix for loop iteration in gsc_runtime_resume
gpio: eic-sprd: Clear interrupt after set the interrupt type
mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan
tick/sched: Preserve number of idle sleeps across CPU hotplug events
x86/entry/ia32: Ensure s32 is sign extended to s64
net/sched: cbs: Fix not adding cbs instance to list
powerpc/mm: Fix null-pointer dereference in pgtable_cache_add
powerpc: Fix build error due to is_valid_bugaddr()
powerpc/mm: Fix build failures due to arch_reserved_kernel_pages()
powerpc/lib: Validate size for vector operations
audit: Send netlink ACK before setting connection in auditd_set
ACPI: video: Add quirk for the Colorful X15 AT 23 Laptop
PNP: ACPI: fix fortify warning
ACPI: extlog: fix NULL pointer dereference check
FS:JFS:UBSAN:array-index-out-of-bounds in dbAdjTree
UBSAN: array-index-out-of-bounds in dtSplitRoot
jfs: fix slab-out-of-bounds Read in dtSearch
jfs: fix array-index-out-of-bounds in dbAdjTree
jfs: fix uaf in jfs_evict_inode
pstore/ram: Fix crash when setting number of cpus to an odd number
crypto: stm32/crc32 - fix parsing list of devices
afs: fix the usage of read_seqbegin_or_lock() in afs_find_server*()
rxrpc_find_service_conn_rcu: fix the usage of read_seqbegin_or_lock()
jfs: fix array-index-out-of-bounds in diNewExt
s390/ptrace: handle setting of fpc register correctly
KVM: s390: fix setting of fpc register
SUNRPC: Fix a suspicious RCU usage warning
ext4: fix inconsistent between segment fstrim and full fstrim
ext4: unify the type of flexbg_size to unsigned int
ext4: remove unnecessary check from alloc_flex_gd()
ext4: avoid online resizing failures due to oversized flex bg
scsi: lpfc: Fix possible file string name overflow when updating firmware
PCI: Add no PM reset quirk for NVIDIA Spectrum devices
bonding: return -ENOMEM instead of BUG in alb_upper_dev_walk
ARM: dts: imx7s: Fix lcdif compatible
ARM: dts: imx7s: Fix nand-controller #size-cells
wifi: ath9k: Fix potential array-index-out-of-bounds read in ath9k_htc_txstatus()
bpf: Add map and need_defer parameters to .map_fd_put_ptr()
scsi: libfc: Don't schedule abort twice
scsi: libfc: Fix up timeout error in fc_fcp_rec_error()
ARM: dts: rockchip: fix rk3036 hdmi ports node
ARM: dts: imx25/27-eukrea: Fix RTC node name
ARM: dts: imx: Use flash@0,0 pattern
ARM: dts: imx27: Fix sram node
ARM: dts: imx1: Fix sram node
ARM: dts: imx27-apf27dev: Fix LED name
ARM: dts: imx23-sansa: Use preferred i2c-gpios properties
ARM: dts: imx23/28: Fix the DMA controller node name
md: Whenassemble the array, consult the superblock of the freshest device
wifi: rtl8xxxu: Add additional USB IDs for RTL8192EU devices
wifi: rtlwifi: rtl8723{be,ae}: using calculate_bit_shift()
wifi: cfg80211: free beacon_ies when overridden from hidden BSS
f2fs: fix to check return value of f2fs_reserve_new_block()
ASoC: doc: Fix undefined SND_SOC_DAPM_NOPM argument
fast_dput(): handle underflows gracefully
RDMA/IPoIB: Fix error code return in ipoib_mcast_join
drm/drm_file: fix use of uninitialized variable
drm/framebuffer: Fix use of uninitialized variable
drm/mipi-dsi: Fix detach call without attach
media: stk1160: Fixed high volume of stk1160_dbg messages
media: rockchip: rga: fix swizzling for RGB formats
PCI: add INTEL_HDA_ARL to pci_ids.h
ALSA: hda: Intel: add HDA_ARL PCI ID support
drm/exynos: Call drm_atomic_helper_shutdown() at shutdown/unbind time
IB/ipoib: Fix mcast list locking
media: ddbridge: fix an error code problem in ddb_probe
drm/msm/dpu: Ratelimit framedone timeout msgs
clk: hi3620: Fix memory leak in hi3620_mmc_clk_init()
clk: mmp: pxa168: Fix memory leak in pxa168_clk_init()
drm/amdgpu: Let KFD sync with VM fences
drm/amdgpu: Drop 'fence' check in 'to_amdgpu_amdkfd_fence()'
leds: trigger: panic: Don't register panic notifier if creating the trigger failed
um: Fix naming clash between UML and scheduler
um: Don't use vfprintf() for os_info()
um: net: Fix return type of uml_net_start_xmit()
mfd: ti_am335x_tscadc: Fix TI SoC dependencies
PCI: Only override AMD USB controller if required
usb: hub: Replace hardcoded quirk value with BIT() macro
libsubcmd: Fix memory leak in uniq()
virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings
blk-mq: fix IO hang from sbitmap wakeup race
ceph: fix deadlock or deadcode of misusing dget()
drm/amdgpu: Release 'adev->pm.fw' before return in 'amdgpu_device_need_post()'
wifi: cfg80211: fix RCU dereference in __cfg80211_bss_update
scsi: isci: Fix an error code problem in isci_io_request_build()
net: remove unneeded break
ixgbe: Remove non-inclusive language
ixgbe: Refactor returning internal error codes
ixgbe: Refactor overtemp event handling
ixgbe: Fix an error handling path in ixgbe_read_iosf_sb_reg_x550()
ipv6: Ensure natural alignment of const ipv6 loopback and router addresses
llc: call sock_orphan() at release time
netfilter: nf_log: replace BUG_ON by WARN_ON_ONCE when putting logger
net: ipv4: fix a memleak in ip_setup_cork
af_unix: fix lockdep positive in sk_diag_dump_icons()
net: sysfs: Fix /sys/class/net/<iface> path
HID: apple: Add support for the 2021 Magic Keyboard
HID: apple: Swap the Fn and Left Control keys on Apple keyboards
HID: apple: Add 2021 magic keyboard FN key mapping
bonding: remove print in bond_verify_device_path
dmaengine: fix is_slave_direction() return false when DMA_DEV_TO_DEV
phy: ti: phy-omap-usb2: Fix NULL pointer dereference for SRP
atm: idt77252: fix a memleak in open_card_ubr0
hwmon: (aspeed-pwm-tacho) mutex for tach reading
hwmon: (coretemp) Fix out-of-bounds memory access
hwmon: (coretemp) Fix bogus core_id to attr name mapping
inet: read sk->sk_family once in inet_recv_error()
rxrpc: Fix response to PING RESPONSE ACKs to a dead call
tipc: Check the bearer type before calling tipc_udp_nl_bearer_add()
ppp_async: limit MRU to 64K
netfilter: nft_compat: reject unused compat flag
netfilter: nft_compat: restrict match/target protocol to u16
net/af_iucv: clean up a try_then_request_module()
USB: serial: qcserial: add new usb-id for Dell Wireless DW5826e
USB: serial: option: add Fibocom FM101-GL variant
USB: serial: cp210x: add ID for IMST iM871A-USB
Input: atkbd - skip ATKBD_CMD_SETLEDS when skipping ATKBD_CMD_GETID
vhost: use kzalloc() instead of kmalloc() followed by memset()
hrtimer: Report offline hrtimer enqueue
btrfs: forbid creating subvol qgroups
btrfs: send: return EOPNOTSUPP on unknown flags
spi: ppc4xx: Drop write-only variable
ASoC: rt5645: Fix deadlock in rt5645_jack_detect_work()
Documentation: net-sysfs: describe missing statistics
net: sysfs: Fix /sys/class/net/<iface> path for statistics
MIPS: Add 'memory' clobber to csum_ipv6_magic() inline assembler
i40e: Fix waiting for queues of all VSIs to be disabled
tracing/trigger: Fix to return error if failed to alloc snapshot
mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again
HID: wacom: generic: Avoid reporting a serial of '0' to userspace
HID: wacom: Do not register input devices until after hid_hw_start
USB: hub: check for alternate port before enabling A_ALT_HNP_SUPPORT
usb: f_mass_storage: forbid async queue when shutdown happen
scsi: Revert "scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock"
firewire: core: correct documentation of fw_csr_string() kernel API
nfc: nci: free rx_data_reassembly skb on NCI device cleanup
xen-netback: properly sync TX responses
binder: signal epoll threads of self-work
ext4: fix double-free of blocks due to wrong extents moved_len
staging: iio: ad5933: fix type mismatch regression
ring-buffer: Clean ring_buffer_poll_wait() error return
serial: max310x: set default value when reading clock ready bit
serial: max310x: improve crystal stable clock detection
x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6
x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
ALSA: hda/conexant: Add quirk for SWS JS201D
nilfs2: fix data corruption in dsync block recovery for small block sizes
nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
nfp: use correct macro for LengthSelect in BAR config
irqchip/irq-brcmstb-l2: Add write memory barrier before exit
pmdomain: core: Move the unused cleanup to a _sync initcall
Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"
sched/membarrier: reduce the ability to hammer on sys_membarrier
nilfs2: fix potential bug in end_buffer_async_write
lsm: new security_file_ioctl_compat() hook
netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()
Linux 4.19.307
Change-Id: Ib05aec445afe9920e2502bcfce1c52db76e27139
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 5266caaf5660529e3da53004b8b7174cab6374ed ]
In blk_mq_mark_tag_wait(), __add_wait_queue() may be re-ordered
with the following blk_mq_get_driver_tag() in case of getting driver
tag failure.
Then in __sbitmap_queue_wake_up(), waitqueue_active() may not observe
the added waiter in blk_mq_mark_tag_wait() and wake up nothing, meantime
blk_mq_mark_tag_wait() can't get driver tag successfully.
This issue can be reproduced by running the following test in loop, and
fio hang can be observed in < 30min when running it on my test VM
in laptop.
modprobe -r scsi_debug
modprobe scsi_debug delay=0 dev_size_mb=4096 max_queue=1 host_max_queue=1 submit_queues=4
dev=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
fio --filename=/dev/"$dev" --direct=1 --rw=randrw --bs=4k --iodepth=1 \
--runtime=100 --numjobs=40 --time_based --name=test \
--ioengine=libaio
Fix the issue by adding one explicit barrier in blk_mq_mark_tag_wait(), which
is just fine in case of running out of tag.
Cc: Jan Kara <jack@suse.cz>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Reported-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240112122626.4181044-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.
Fix the issue by the following approach:
1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()
2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]
So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.
The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.
Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Bug: 197804811
Change-Id: If49478d7b05d3f5b0a26966ddf9ae764cf2fb6b0
(cherry picked from commit bd63141d585bef14f4caf111f6d0e27fe2300ec6)
[ refactored to avoid breaking KMI ]
Signed-off-by: Pradeep P V K <pragalla@codeaurora.org>
Signed-off-by: Todd Kjos <tkjos@google.com>
(cherry picked from commit bb96e7f45dc6ac1d6ec12190f1f286e3014fb068)
Signed-off-by: Lee Jones <joneslee@google.com>
[ Upstream commit 630ef623ed26c18a457cdc070cf24014e50129c2 ]
If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c6ba933358f0d7a6a042b894dba20cc70396a6d3 upstream.
blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue()
and blk_mq_update_nr_hw_queues(). For the former caller, the kobject
isn't exposed to userspace yet. For the latter caller, hctx sysfs entries
and debugfs are un-registered before updating nr_hw_queues.
On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after
request queue is freed") moves freeing hctx into queue's release
handler, so there won't be race with queue release path too.
So don't hold q->sysfs_lock in blk_mq_map_swqueue().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c92a41031a6d57395889b5c87cea359220a24d2a ]
Factor out the requeue handling from the dispatch code, this will make
subsequent addition of different requeueing schemes easier.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit d7d8535f377e9ba87edbf7fbbd634ac942f3f54f upstream.
SCHED_RESTART code path is relied to re-run queue for dispatch requests
in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
requests to hctx->dispatch.
memory barriers have to be used for ordering the following two pair of OPs:
1) adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()
2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().
Without the added memory barrier, either:
1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
dispatch side
or
2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in
hctx->dispatch from blk_mq_sched_restart() is missed.
IO hang in ltp/fs_fill test is reported by kernel test robot:
https://lkml.org/lkml/2020/7/26/77
Turns out it is caused by the above out-of-order OPs. And the IO hang
can't be observed any more after applying this patch.
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8d6996630c03d7ceeabe2611378fea5ca1c3f1b3 upstream.
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[ Upstream commit e26cc08265dda37d2acc8394604f220ef412299d ]
blk_exit_queue will free elevator_data, while blk_mq_requeue_work
will access it. Move cancel of requeue_work to the front of
blk_exit_queue to avoid use-after-free.
blk_exit_queue blk_mq_requeue_work
__elevator_exit blk_mq_run_hw_queues
blk_mq_exit_sched blk_mq_run_hw_queue
dd_exit_queue blk_mq_hctx_has_pending
kfree(elevator_data) blk_mq_sched_has_work
dd_has_work
Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5b202853ffbc54b29f23c4b1b5f3948efab489a2 ]
blk_mq_realloc_hw_ctxs could be invoked during update hw queues.
At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL
to GFP_NOIO to avoid forever hang during memory allocation in
blk_mq_realloc_hw_ctxs.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c7e2d94b3d1634988a95ac4d77a72dc7487ece06 ]
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909
("blk-mq: Fix a use-after-free") fixes this issue exactly.
However, that commit introduces another issue. Before 45a9c9d909,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f3 ("blk-mq: quiesce queue before freeing queue")
But still can't cover all cases, recently James reports another such
kind of issue:
https://marc.info/?l=linux-scsi&m=155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.
Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.
This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ]
With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.
So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream
A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.
Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.
Signed-off-by: Eric Wheeler <bfq@linux.ewheeler.net>
Tested-by: Kai Krakow <kai@kaishome.de>
Reported-by: Paolo Valente <paolo.valente@linaro.org>
Fixes: f0635b8a41 ("bfq: calculate shallow depths at init time")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ]
When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),
kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream.
After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.
Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.
This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.
Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream.
If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.
This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:
[ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
because most of it is garbage...
This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.
Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.
See also:
https://bugzilla.kernel.org/show_bug.cgi?id=201685
Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6eec ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs. schedule() unplugs are implicit and should be
reported as timer unplugs. While correct in the legacy code, this has
been inverted in blk-mq since 4.11.
Cc: stable@vger.kernel.org
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull more block updates from Jens Axboe:
- Set of bcache fixes and changes (Coly)
- The flush warn fix (me)
- Small series of BFQ fixes (Paolo)
- wbt hang fix (Ming)
- blktrace fix (Steven)
- blk-mq hardware queue count update fix (Jianchao)
- Various little fixes
* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
block/DAC960.c: make some arrays static const, shrinks object size
blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
blk-mq: init hctx sched after update ctx and hctx mapping
block: remove duplicate initialization
tracing/blktrace: Fix to allow setting same value
pktcdvd: fix setting of 'ret' error return for a few cases
block: change return type to bool
block, bfq: return nbytes and not zero from struct cftype .write() method
block, bfq: improve code of bfq_bfqq_charge_time
block, bfq: reduce write overcharge
block, bfq: always update the budget of an entity when needed
block, bfq: readd missing reset of parent-entity service
blk-wbt: fix IO hang in wbt_wait()
block: don't warn for flush on read-only device
bcache: add the missing comments for smp_mb()/smp_wmb()
bcache: remove unnecessary space before ioctl function pointer arguments
bcache: add missing SPDX header
bcache: move open brace at end of function definitions to next line
bcache: add static const prefix to char * array declarations
bcache: fix code comments style
...
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.
Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
Currently, we count the hctx as active after allocate driver tag
successfully. If a previously inactive hctx try to get tag first
time, it may fails and need to wait. However, due to the stale tag
->active_queues, the other shared-tags users are still able to
occupy all driver tags while there is someone waiting for tag.
Consequently, even if the previously inactive hctx is waked up, it
still may not be able to get a tag and could be starved.
To fix it, we count the hctx as active before try to allocate driver
tag, then when it is waiting the tag, the other shared-tag users
will reserve budget for it.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Bigger than usual at this time, mostly due to the O_DIRECT corruption
issue and the fact that I was on vacation last week. This contains:
- NVMe pull request with two fixes for the FC code, and two target
fixes (Christoph)
- a DIF bio reset iteration fix (Greg Edwards)
- two nbd reply and requeue fixes (Josef)
- SCSI timeout fixup (Keith)
- a small series that fixes an issue with bio_iov_iter_get_pages(),
which ended up causing corruption for larger sized O_DIRECT writes
that ended up racing with buffered writes (Martin Wilck)"
* tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
block: reset bi_iter.bi_done after splitting bio
block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
blkdev: __blkdev_direct_IO_simple: fix leak in error case
block: bio_iov_iter_get_pages: fix size of last iovec
nvmet: only check for filebacking on -ENOTBLK
nvmet: fixup crash on NULL device path
scsi: set timed out out mq requests to complete
blk-mq: export setting request completion state
nvme: if_ready checks to fail io to deleting controller
nvmet-fc: fix target sgl list on large transfers
nbd: handle unexpected replies better
nbd: don't requeue the same request twice.
This is preparing for drivers that want to directly alter the state of
their requests. No functional change here.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.
There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.
This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.
This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.
Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix typo in a function blk_mq_alloc_tag_set() comment.
if if it too large -> if it's too large.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c. It's also cleared before
updating a queue map in blk_mq_update_queue_map().
This patch provides an API to clear set->mq_map to make it clear.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().
This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:
1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.
2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.
In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.
Fixes: 705cda97ee ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Reported-by: Andrew Jones <drjones@redhat.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now hctx->lock is only acquired when adding hctx->dispatch_wait to
one wait queue, but not held when removing it from the wait queue.
IO hang can be observed easily if SCHED RESTART is disabled, that means
now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().
This patch fixes the issue by introducing hctx->dispatch_wait_lock and
holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
since we need to avoid acquiring hctx->lock in irq context.
Fixes: eb619fdb2d ("blk-mq: fix issue with shared tag queue re-running")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
we never change '**hctx' as well. The last use of these went away
with the flush cleanup, commit 0c2a6fe4dc.
So cleanup the usage and remove the two extra parameters.
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Tested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Small set of fixes for this series. Mostly just minor fixes, the only
oddball in here is the sg change.
The sg change came out of the stall fix for NVMe, where we added a
mempool and limited us to a single page allocation. CONFIG_SG_DEBUG
sort-of ruins that, since we'd need to account for that. That's
actually a generic problem, since lots of drivers need to allocate SG
lists. So this just removes support for CONFIG_SG_DEBUG, which I added
back in 2007 and to my knowledge it was never useful.
Anyway, outside of that, this pull contains:
- clone of request with special payload fix (Bart)
- drbd discard handling fix (Bart)
- SATA blk-mq stall fix (me)
- chunk size fix (Keith)
- double free nvme rdma fix (Sagi)"
* tag 'for-linus-20180629' of git://git.kernel.dk/linux-block:
sg: remove ->sg_magic member
drbd: Fix drbd_request_prepare() discard handling
blk-mq: don't queue more if we get a busy return
block: Fix cloning of requests with a special payload
nvme-rdma: fix possible double free of controller async event buffer
block: Fix transfer when chunk sectors exceeds max
Some devices have different queue limits depending on the type of IO. A
classic case is SATA NCQ, where some commands can queue, but others
cannot. If we have NCQ commands inflight and encounter a non-queueable
command, the driver returns busy. Currently we attempt to dispatch more
from the scheduler, if we were able to queue some commands. But for the
case where we ended up stopping due to BUSY, we should not attempt to
retrieve more from the scheduler. If we do, we can get into a situation
where we attempt to queue a non-queueable command, get BUSY, then
successfully retrieve more commands from that scheduler and queue those.
This can repeat forever, starving the non-queuable command indefinitely.
Fix this by NOT attempting to pull more commands from the scheduler, if
we get a BUSY return. This should also be more optimal in terms of
letting requests stay in the scheduler for as long as possible, if we
get a BUSY due to the regular out-of-tags condition.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
- Further timeout fixes. We aren't quite there yet, so expect another
round of fixes for that to completely close some of the IRQ vs
completion races. (Christoph/Bart)
- Set of NVMe fixes from the usual suspects, mostly error handling
- Two off-by-one fixes (Dan)
- Another bdi race fix (Jan)
- Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)
* tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
bdi: Fix another oops in wb_workfn()
lightnvm: Remove depends on HAS_DMA in case of platform dependency
nvme-pci: limit max IO size and segments to avoid high order allocations
nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
nvme-fc: release io queues to allow fast fail
nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
block: sed-opal: Fix a couple off by one bugs
blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
nvmet: reset keep alive timer in controller enable
nvme-rdma: don't override opts->queue_size
nvme-rdma: Fix command completion race at error recovery
nvme-rdma: fix possible free of a non-allocated async event buffer
nvme-rdma: fix possible double free condition when failing to create a controller
Revert "block: Add warning for bi_next not NULL in bio_endio()"
block: fix timeout changes for legacy request drivers
Make sure that RQF_TIMED_OUT is cleared when a request is reused
after a block driver timeout handler has returned BLK_EH_DONE.
Fixes: da66126739 ("blk-mq: don't time out requests again that are in the timeout handler")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Andrew Randrianasulu <randrianasulu@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"A collection of fixes that should go into -rc1. This contains:
- bsg_open vs bsg_unregister race fix (Anatoliy)
- NVMe pull request from Christoph, with fixes for regressions in
this window, FC connect/reconnect path code unification, and a
trace point addition.
- timeout fix (Christoph)
- remove a few unused functions (Christoph)
- blk-mq tag_set reinit fix (Roman)"
* tag 'for-linus-20180616' of git://git.kernel.dk/linux-block:
bsg: fix race of bsg_open and bsg_unregister
block: remov blk_queue_invalidate_tags
nvme-fabrics: fix and refine state checks in __nvmf_check_ready
nvme-fabrics: handle the admin-only case properly in nvmf_check_ready
nvme-fabrics: refactor queue ready check
blk-mq: remove blk_mq_tagset_iter
nvme: remove nvme_reinit_tagset
nvme-fc: fix nulling of queue data on reconnect
nvme-fc: remove reinit_request routine
blk-mq: don't time out requests again that are in the timeout handler
nvme-fc: change controllers first connect to use reconnect path
nvme: don't rely on the changed namespace list log
nvmet: free smart-log buffer after use
nvme-rdma: fix error flow during mapping request data
nvme: add bio remapping tracepoint
nvme: fix NULL pointer dereference in nvme_init_subsystem
blk-mq: reinit q->tag_set_list entry only after grace period
We can currently call the timeout handler again on a request that has
already been handed over to the timeout handler. Prevent that with a new
flag.
Fixes: 12f5b931 ("blk-mq: Remove generation seqeunce")
Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Tested-by: Andrew Randrianasulu <randrianasulu@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is not allowed to reinit q->tag_set_list list entry while RCU grace
period has not completed yet, otherwise the following soft lockup in
blk_mq_sched_restart() happens:
[ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
[ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
[ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
[ 1064.256510] Call Trace:
[ 1064.256664] <IRQ>
[ 1064.256824] blk_mq_free_request+0xea/0x100
[ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client]
[ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client]
[ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
[ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
[ 1064.257669] ib_create_qp+0x321/0x380 [ib_core]
[ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core]
[ 1064.258007] irq_poll_softirq+0xb7/0xe0
[ 1064.258165] __do_softirq+0x106/0x2a2
[ 1064.258328] irq_exit+0x92/0xa0
[ 1064.258509] do_IRQ+0x4a/0xd0
[ 1064.258660] common_interrupt+0x7a/0x7a
[ 1064.258818] </IRQ>
Meanwhile another context frees other queue but with the same set of
shared tags:
[ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
[ 1288.201833] bash D 0 5910 5820 0x00000000
[ 1288.202016] Call Trace:
[ 1288.202315] schedule+0x32/0x80
[ 1288.202462] schedule_timeout+0x1e5/0x380
[ 1288.203838] wait_for_completion+0xb0/0x120
[ 1288.204137] __wait_rcu_gp+0x125/0x160
[ 1288.204287] synchronize_sched+0x6e/0x80
[ 1288.204770] blk_mq_free_queue+0x74/0xe0
[ 1288.204922] blk_cleanup_queue+0xc7/0x110
[ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
[ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
[ 1288.205548] kernfs_fop_write+0x109/0x180
[ 1288.206328] vfs_write+0xb3/0x1a0
[ 1288.206476] SyS_write+0x52/0xc0
[ 1288.206624] do_syscall_64+0x68/0x1d0
[ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
What happened is the following:
1. There are several MQ queues with shared tags.
2. One queue is about to be freed and now task is in
blk_mq_del_queue_tag_set().
3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
tag list in order to find hctx to restart.
Because linked list entry was modified in blk_mq_del_queue_tag_set()
without proper waiting for a grace period, blk_mq_sched_restart()
never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
Fix is simple: reinit list entry after an RCU grace period elapsed.
Fixes: Fixes: 705cda97ee ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: stable@vger.kernel.org
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a hardware queue is stopped, it should not be run again before
explicitly started. Ignore stopped queues in blk_mq_run_work_fn(),
fixing a regression recently introduced when the START_ON_RUN bit
was removed.
Fixes: 15fe8a90bb ("blk-mq: remove blk_mq_delay_queue()")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both callers take just around so function call, so move it in.
Also remove the now pointless blk_mq_sched_init wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>