kernel_xiaomi_sm8250

Author	SHA1	Message	Date
Greg Kroah-Hartman	874391c94e	Merge 4.19.325 into android-4.19-stable Changes in 4.19.325 netlink: terminate outstanding dump on socket close ocfs2: uncache inode which has failed entering the group nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint ocfs2: fix UBSAN warning in ocfs2_verify_volume() nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint Revert "mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K" media: dvbdev: fix the logic when DVB_DYNAMIC_MINORS is not set kbuild: Use uname for LINUX_COMPILE_HOST detection mm: revert "mm: shmem: fix data-race in shmem_getattr()" ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet mac80211: fix user-power when emulating chanctx selftests/watchdog-test: Fix system accidentally reset after watchdog-test x86/amd_nb: Fix compile-testing without CONFIG_AMD_NB net: usb: qmi_wwan: add Quectel RG650V proc/softirqs: replace seq_printf with seq_put_decimal_ull_width nvme: fix metadata handling in nvme-passthrough initramfs: avoid filename buffer overrun m68k: mvme147: Fix SCSI controller IRQ numbers m68k: mvme16x: Add and use "mvme16x.h" m68k: mvme147: Reinstate early console acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block() s390/syscalls: Avoid creation of arch/arch/ directory hfsplus: don't query the device logical block size multiple times EDAC/fsl_ddr: Fix bad bit shift operations crypto: pcrypt - Call crypto layer directly when padata_do_parallel() return -EBUSY crypto: cavium - Fix the if condition to exit loop after timeout crypto: bcm - add error check in the ahash_hmac_init function crypto: cavium - Fix an error handling path in cpt_ucode_load_fw() time: Fix references to _msecs_to_jiffies() handling of values soc: qcom: geni-se: fix array underflow in geni_se_clk_tbl_get() mmc: mmc_spi: drop buggy snprintf() ARM: dts: cubieboard4: Fix DCDC5 regulator constraints regmap: irq: Set lockdep class for hierarchical IRQ domains firmware: arm_scpi: Check the DVFS OPP count returned by the firmware drm/mm: Mark drm_mm_interval_tree() functions with __maybe_unused wifi: ath9k: add range check for conn_rsp_epid in htc_connect_service() drm/omap: Fix locking in omap_gem_new_dmabuf() bpf: Fix the xdp_adjust_tail sample prog issue wifi: mwifiex: Fix memcpy() field-spanning write warning in mwifiex_config_scan() drm/etnaviv: consolidate hardware fence handling in etnaviv_gpu drm/etnaviv: dump: fix sparse warnings drm/etnaviv: fix power register offset on GC300 drm/etnaviv: hold GPU lock across perfmon sampling net: rfkill: gpio: Add check for clk_enable() ALSA: us122l: Use snd_card_free_when_closed() at disconnection ALSA: caiaq: Use snd_card_free_when_closed() at disconnection ALSA: 6fire: Release resources at card release netpoll: Use rcu_access_pointer() in netpoll_poll_lock trace/trace_event_perf: remove duplicate samples on the first tracepoint event powerpc/vdso: Flag VDSO64 entry points as functions mfd: da9052-spi: Change read-mask to write-mask cpufreq: loongson2: Unregister platform_driver on failure mtd: rawnand: atmel: Fix possible memory leak RDMA/bnxt_re: Check cqe flags to know imm_data vs inv_irkey mfd: rt5033: Fix missing regmap_del_irq_chip() scsi: bfa: Fix use-after-free in bfad_im_module_exit() scsi: fusion: Remove unused variable 'rc' scsi: qedi: Fix a possible memory leak in qedi_alloc_and_init_sb() ocfs2: fix uninitialized value in ocfs2_file_read_iter() powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static fbdev/sh7760fb: Alloc DMA memory from hardware device fbdev: sh7760fb: Fix a possible memory leak in sh7760fb_alloc_mem() dt-bindings: clock: adi,axi-clkgen: convert old binding to yaml format dt-bindings: clock: axi-clkgen: include AXI clk clk: axi-clkgen: use devm_platform_ioremap_resource() short-hand clk: clk-axi-clkgen: make sure to enable the AXI bus clock perf probe: Correct demangled symbols in C++ program PCI: cpqphp: Use PCI_POSSIBLE_ERROR() to check config reads PCI: cpqphp: Fix PCIBIOS_ return value confusion m68k: mcfgpio: Fix incorrect register offset for CONFIG_M5441x m68k: coldfire/device.c: only build FEC when HW macros are defined rpmsg: glink: Add TX_DATA_CONT command while sending rpmsg: glink: Send READ_NOTIFY command in FIFO full case rpmsg: glink: Fix GLINK command prefix rpmsg: glink: use only lower 16-bits of param2 for CMD_OPEN name length NFSD: Prevent NULL dereference in nfsd4_process_cb_update() NFSD: Cap the number of bytes copied by nfs4_reset_recoverydir() vfio/pci: Properly hide first-in-list PCIe extended capability power: supply: core: Remove might_sleep() from power_supply_put() net: usb: lan78xx: Fix memory leak on device unplug by freeing PHY device tg3: Set coherent DMA mask bits to 31 for BCM57766 chipsets net: usb: lan78xx: Fix refcounting and autosuspend on invalid WoL configuration marvell: pxa168_eth: fix call balance of pep->clk handling routines net: stmmac: dwmac-socfpga: Set RX watchdog interrupt as broken usb: using mutex lock and supporting O_NONBLOCK flag in iowarrior_read() USB: chaoskey: fail open after removal USB: chaoskey: Fix possible deadlock chaoskey_list_lock misc: apds990x: Fix missing pm_runtime_disable() apparmor: fix 'Do simple duplicate message elimination' usb: ehci-spear: fix call balance of sehci clk handling routines ext4: supress data-race warnings in ext4_free_inodes_{count,set}() ext4: fix FS_IOC_GETFSMAP handling jfs: xattr: check invalid xattr size more strictly ASoC: codecs: Fix atomicity violation in snd_soc_component_get_drvdata() PCI: Fix use-after-free of slot->bus on hot remove tty: ldsic: fix tty_ldisc_autoload sysctl's proc_handler Bluetooth: Fix type of len in rfcomm_sock_getsockopt{,_old}() ALSA: usb-audio: Fix potential out-of-bound accesses for Extigy and Mbox devices Revert "usb: gadget: composite: fix OS descriptors w_value logic" serial: sh-sci: Clean sci_ports[0] after at earlycon exit Revert "serial: sh-sci: Clean sci_ports[0] after at earlycon exit" netfilter: ipset: add missing range check in bitmap_ip_uadt spi: Fix acpi deferred irq probe ubi: wl: Put source PEB into correct list if trying locking LEB failed um: ubd: Do not use drvdata in release um: net: Do not use drvdata in release serial: 8250: omap: Move pm_runtime_get_sync um: vector: Do not use drvdata in release sh: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled block: fix ordering between checking BLK_MQ_S_STOPPED request adding HID: wacom: Interpret tilt data from Intuos Pro BT as signed values media: wl128x: Fix atomicity violation in fmc_send_cmd() usb: dwc3: gadget: Fix checking for number of TRBs left lib: string_helpers: silence snprintf() output truncation warning NFSD: Prevent a potential integer overflow rpmsg: glink: Propagate TX failures in intentless mode as well um: Fix the return value of elf_core_copy_task_fpregs NFSv4.0: Fix a use-after-free problem in the asynchronous open() rtc: check if __rtc_read_time was successful in rtc_timer_do_work() ubifs: Correct the total block count by deducting journal reservation ubi: fastmap: Fix duplicate slab cache names while attaching jffs2: fix use of uninitialized variable block: return unsigned int from bdev_io_min 9p/xen: fix init sequence 9p/xen: fix release of IRQ modpost: remove incorrect code in do_eisa_entry() sh: intc: Fix use-after-free bug in register_intc_controller() Linux 4.19.325 Change-Id: I50250c8bd11f9ff4b40da75225c1cfb060e0c258 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-12-05 11:21:28 +00:00
Muchun Song	3e62d49f59	block: fix ordering between checking BLK_MQ_S_STOPPED request adding commit 96a9fe64bfd486ebeeacf1e6011801ffe89dae18 upstream. Supposing first scenario with a virtio_blk driver. CPU0 CPU1 blk_mq_try_issue_directly() __blk_mq_issue_directly() q->mq_ops->queue_rq() virtio_queue_rq() blk_mq_stop_hw_queue() virtblk_done() blk_mq_request_bypass_insert() 1) store blk_mq_start_stopped_hw_queue() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load return __blk_mq_sched_dispatch_requests() Supposing another scenario. CPU0 CPU1 blk_mq_requeue_work() blk_mq_insert_request() 1) store virtblk_done() blk_mq_start_stopped_hw_queue() blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load continue blk_mq_run_hw_queue() Both scenarios are similar, the full memory barrier should be inserted between 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU will not rerun the hardware queue causing starvation of the request. The easy way to fix it is to add the essential full memory barrier into helper of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue is not stopped most of the time), we only insert the barrier into the slow path. Actually, only slow path needs to care about missing of dispatching the request to the low-level device driver. Fixes: `320ae51fee` ("blk-mq: new multi-queue block IO queueing mechanism") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-05 10:59:40 +01:00
Greg Kroah-Hartman	45df1db3d3	Merge 4.19.307 into android-4.19-stable Changes in 4.19.307 PCI: mediatek: Clear interrupt status before dispatching handler include/linux/units.h: add helpers for kelvin to/from Celsius conversion units: Add Watt units units: change from 'L' to 'UL' units: add the HZ macros serial: sc16is7xx: set safe default SPI clock frequency driver core: add device probe log helper spi: introduce SPI_MODE_X_MASK macro serial: sc16is7xx: add check for unsupported SPI modes during probe ext4: allow for the last group to be marked as trimmed crypto: api - Disallow identical driver names PM: hibernate: Enforce ordering during image compression/decompression hwrng: core - Fix page fault dead lock on mmap-ed hwrng rpmsg: virtio: Free driver_override when rpmsg_remove() parisc/firmware: Fix F-extend for PDC addresses nouveau/vmm: don't set addr on the fail path to avoid warning block: Remove special-casing of compound pages powerpc: Use always instead of always-y in for crtsavres.o x86/CPU/AMD: Fix disabling XSAVES on AMD family 0x17 due to erratum driver core: Annotate dev_err_probe() with __must_check Revert "driver core: Annotate dev_err_probe() with __must_check" driver code: print symbolic error code drivers: core: fix kernel-doc markup for dev_err_probe() net/smc: fix illegal rmb_desc access in SMC-D connection dump vlan: skip nested type that is not IFLA_VLAN_QOS_MAPPING llc: make llc_ui_sendmsg() more robust against bonding changes llc: Drop support for ETH_P_TR_802_2. net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recv tracing: Ensure visibility when inserting an element into tracing_map tcp: Add memory barrier to tcp_push() netlink: fix potential sleeping issue in mqueue_flush_file net/mlx5: Use kfree(ft->g) in arfs_create_groups() net/mlx5e: fix a double-free in arfs_create_groups netfilter: nf_tables: restrict anonymous set and map names to 16 bytes fjes: fix memleaks in fjes_hw_setup net: fec: fix the unhandled context fault from smmu btrfs: don't warn if discard range is not aligned to sector btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args netfilter: nf_tables: reject QUEUE/DROP verdict parameters gpiolib: acpi: Ignore touchpad wakeup on GPD G1619-04 drm: Don't unref the same fb many times by mistake due to deadlock handling drm/bridge: nxp-ptn3460: fix i2c_master_send() error checking drm/bridge: nxp-ptn3460: simplify some error checking drm/exynos: gsc: minor fix for loop iteration in gsc_runtime_resume gpio: eic-sprd: Clear interrupt after set the interrupt type mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan tick/sched: Preserve number of idle sleeps across CPU hotplug events x86/entry/ia32: Ensure s32 is sign extended to s64 net/sched: cbs: Fix not adding cbs instance to list powerpc/mm: Fix null-pointer dereference in pgtable_cache_add powerpc: Fix build error due to is_valid_bugaddr() powerpc/mm: Fix build failures due to arch_reserved_kernel_pages() powerpc/lib: Validate size for vector operations audit: Send netlink ACK before setting connection in auditd_set ACPI: video: Add quirk for the Colorful X15 AT 23 Laptop PNP: ACPI: fix fortify warning ACPI: extlog: fix NULL pointer dereference check FS:JFS:UBSAN:array-index-out-of-bounds in dbAdjTree UBSAN: array-index-out-of-bounds in dtSplitRoot jfs: fix slab-out-of-bounds Read in dtSearch jfs: fix array-index-out-of-bounds in dbAdjTree jfs: fix uaf in jfs_evict_inode pstore/ram: Fix crash when setting number of cpus to an odd number crypto: stm32/crc32 - fix parsing list of devices afs: fix the usage of read_seqbegin_or_lock() in afs_find_server*() rxrpc_find_service_conn_rcu: fix the usage of read_seqbegin_or_lock() jfs: fix array-index-out-of-bounds in diNewExt s390/ptrace: handle setting of fpc register correctly KVM: s390: fix setting of fpc register SUNRPC: Fix a suspicious RCU usage warning ext4: fix inconsistent between segment fstrim and full fstrim ext4: unify the type of flexbg_size to unsigned int ext4: remove unnecessary check from alloc_flex_gd() ext4: avoid online resizing failures due to oversized flex bg scsi: lpfc: Fix possible file string name overflow when updating firmware PCI: Add no PM reset quirk for NVIDIA Spectrum devices bonding: return -ENOMEM instead of BUG in alb_upper_dev_walk ARM: dts: imx7s: Fix lcdif compatible ARM: dts: imx7s: Fix nand-controller #size-cells wifi: ath9k: Fix potential array-index-out-of-bounds read in ath9k_htc_txstatus() bpf: Add map and need_defer parameters to .map_fd_put_ptr() scsi: libfc: Don't schedule abort twice scsi: libfc: Fix up timeout error in fc_fcp_rec_error() ARM: dts: rockchip: fix rk3036 hdmi ports node ARM: dts: imx25/27-eukrea: Fix RTC node name ARM: dts: imx: Use flash@0,0 pattern ARM: dts: imx27: Fix sram node ARM: dts: imx1: Fix sram node ARM: dts: imx27-apf27dev: Fix LED name ARM: dts: imx23-sansa: Use preferred i2c-gpios properties ARM: dts: imx23/28: Fix the DMA controller node name md: Whenassemble the array, consult the superblock of the freshest device wifi: rtl8xxxu: Add additional USB IDs for RTL8192EU devices wifi: rtlwifi: rtl8723{be,ae}: using calculate_bit_shift() wifi: cfg80211: free beacon_ies when overridden from hidden BSS f2fs: fix to check return value of f2fs_reserve_new_block() ASoC: doc: Fix undefined SND_SOC_DAPM_NOPM argument fast_dput(): handle underflows gracefully RDMA/IPoIB: Fix error code return in ipoib_mcast_join drm/drm_file: fix use of uninitialized variable drm/framebuffer: Fix use of uninitialized variable drm/mipi-dsi: Fix detach call without attach media: stk1160: Fixed high volume of stk1160_dbg messages media: rockchip: rga: fix swizzling for RGB formats PCI: add INTEL_HDA_ARL to pci_ids.h ALSA: hda: Intel: add HDA_ARL PCI ID support drm/exynos: Call drm_atomic_helper_shutdown() at shutdown/unbind time IB/ipoib: Fix mcast list locking media: ddbridge: fix an error code problem in ddb_probe drm/msm/dpu: Ratelimit framedone timeout msgs clk: hi3620: Fix memory leak in hi3620_mmc_clk_init() clk: mmp: pxa168: Fix memory leak in pxa168_clk_init() drm/amdgpu: Let KFD sync with VM fences drm/amdgpu: Drop 'fence' check in 'to_amdgpu_amdkfd_fence()' leds: trigger: panic: Don't register panic notifier if creating the trigger failed um: Fix naming clash between UML and scheduler um: Don't use vfprintf() for os_info() um: net: Fix return type of uml_net_start_xmit() mfd: ti_am335x_tscadc: Fix TI SoC dependencies PCI: Only override AMD USB controller if required usb: hub: Replace hardcoded quirk value with BIT() macro libsubcmd: Fix memory leak in uniq() virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings blk-mq: fix IO hang from sbitmap wakeup race ceph: fix deadlock or deadcode of misusing dget() drm/amdgpu: Release 'adev->pm.fw' before return in 'amdgpu_device_need_post()' wifi: cfg80211: fix RCU dereference in __cfg80211_bss_update scsi: isci: Fix an error code problem in isci_io_request_build() net: remove unneeded break ixgbe: Remove non-inclusive language ixgbe: Refactor returning internal error codes ixgbe: Refactor overtemp event handling ixgbe: Fix an error handling path in ixgbe_read_iosf_sb_reg_x550() ipv6: Ensure natural alignment of const ipv6 loopback and router addresses llc: call sock_orphan() at release time netfilter: nf_log: replace BUG_ON by WARN_ON_ONCE when putting logger net: ipv4: fix a memleak in ip_setup_cork af_unix: fix lockdep positive in sk_diag_dump_icons() net: sysfs: Fix /sys/class/net/<iface> path HID: apple: Add support for the 2021 Magic Keyboard HID: apple: Swap the Fn and Left Control keys on Apple keyboards HID: apple: Add 2021 magic keyboard FN key mapping bonding: remove print in bond_verify_device_path dmaengine: fix is_slave_direction() return false when DMA_DEV_TO_DEV phy: ti: phy-omap-usb2: Fix NULL pointer dereference for SRP atm: idt77252: fix a memleak in open_card_ubr0 hwmon: (aspeed-pwm-tacho) mutex for tach reading hwmon: (coretemp) Fix out-of-bounds memory access hwmon: (coretemp) Fix bogus core_id to attr name mapping inet: read sk->sk_family once in inet_recv_error() rxrpc: Fix response to PING RESPONSE ACKs to a dead call tipc: Check the bearer type before calling tipc_udp_nl_bearer_add() ppp_async: limit MRU to 64K netfilter: nft_compat: reject unused compat flag netfilter: nft_compat: restrict match/target protocol to u16 net/af_iucv: clean up a try_then_request_module() USB: serial: qcserial: add new usb-id for Dell Wireless DW5826e USB: serial: option: add Fibocom FM101-GL variant USB: serial: cp210x: add ID for IMST iM871A-USB Input: atkbd - skip ATKBD_CMD_SETLEDS when skipping ATKBD_CMD_GETID vhost: use kzalloc() instead of kmalloc() followed by memset() hrtimer: Report offline hrtimer enqueue btrfs: forbid creating subvol qgroups btrfs: send: return EOPNOTSUPP on unknown flags spi: ppc4xx: Drop write-only variable ASoC: rt5645: Fix deadlock in rt5645_jack_detect_work() Documentation: net-sysfs: describe missing statistics net: sysfs: Fix /sys/class/net/<iface> path for statistics MIPS: Add 'memory' clobber to csum_ipv6_magic() inline assembler i40e: Fix waiting for queues of all VSIs to be disabled tracing/trigger: Fix to return error if failed to alloc snapshot mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again HID: wacom: generic: Avoid reporting a serial of '0' to userspace HID: wacom: Do not register input devices until after hid_hw_start USB: hub: check for alternate port before enabling A_ALT_HNP_SUPPORT usb: f_mass_storage: forbid async queue when shutdown happen scsi: Revert "scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock" firewire: core: correct documentation of fw_csr_string() kernel API nfc: nci: free rx_data_reassembly skb on NCI device cleanup xen-netback: properly sync TX responses binder: signal epoll threads of self-work ext4: fix double-free of blocks due to wrong extents moved_len staging: iio: ad5933: fix type mismatch regression ring-buffer: Clean ring_buffer_poll_wait() error return serial: max310x: set default value when reading clock ready bit serial: max310x: improve crystal stable clock detection x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6 x86/mm/ident_map: Use gbpages only where full GB page should be mapped. ALSA: hda/conexant: Add quirk for SWS JS201D nilfs2: fix data corruption in dsync block recovery for small block sizes nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() nfp: use correct macro for LengthSelect in BAR config irqchip/irq-brcmstb-l2: Add write memory barrier before exit pmdomain: core: Move the unused cleanup to a _sync initcall Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" sched/membarrier: reduce the ability to hammer on sys_membarrier nilfs2: fix potential bug in end_buffer_async_write lsm: new security_file_ioctl_compat() hook netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval() Linux 4.19.307 Change-Id: Ib05aec445afe9920e2502bcfce1c52db76e27139 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-04-15 10:17:13 +00:00
Ming Lei	9525b38180	blk-mq: fix IO hang from sbitmap wakeup race [ Upstream commit 5266caaf5660529e3da53004b8b7174cab6374ed ] In blk_mq_mark_tag_wait(), __add_wait_queue() may be re-ordered with the following blk_mq_get_driver_tag() in case of getting driver tag failure. Then in __sbitmap_queue_wake_up(), waitqueue_active() may not observe the added waiter in blk_mq_mark_tag_wait() and wake up nothing, meantime blk_mq_mark_tag_wait() can't get driver tag successfully. This issue can be reproduced by running the following test in loop, and fio hang can be observed in < 30min when running it on my test VM in laptop. modprobe -r scsi_debug modprobe scsi_debug delay=0 dev_size_mb=4096 max_queue=1 host_max_queue=1 submit_queues=4 dev=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter/host/target//block/* \| head -1 \| xargs basename` fio --filename=/dev/"$dev" --direct=1 --rw=randrw --bs=4k --iodepth=1 \ --runtime=100 --numjobs=40 --time_based --name=test \ --ioengine=libaio Fix the issue by adding one explicit barrier in blk_mq_mark_tag_wait(), which is just fine in case of running out of tag. Cc: Jan Kara <jack@suse.cz> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240112122626.4181044-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-02-23 08:12:51 +01:00
Ming Lei	f198e0a1af	BACKPORT: blk-mq: clear stale request in tags->rq[] before freeing one request pool refcount_inc_not_zero() in bt_tags_iter() still may read one freed request. Fix the issue by the following approach: 1) hold a per-tags spinlock when reading ->rqs[tag] and calling refcount_inc_not_zero in bt_tags_iter() 2) clearing stale request referred via ->rqs[tag] before freeing request pool, the per-tags spinlock is held for clearing stale ->rq[tag] So after we cleared stale requests, bt_tags_iter() won't observe freed request any more, also the clearing will wait for pending request reference. The idea of clearing ->rqs[] is borrowed from John Garry's previous patch and one recent David's patch. Tested-by: John Garry <john.garry@huawei.com> Reviewed-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Bug: 197804811 Change-Id: If49478d7b05d3f5b0a26966ddf9ae764cf2fb6b0 (cherry picked from commit bd63141d585bef14f4caf111f6d0e27fe2300ec6) [ refactored to avoid breaking KMI ] Signed-off-by: Pradeep P V K <pragalla@codeaurora.org> Signed-off-by: Todd Kjos <tkjos@google.com> (cherry picked from commit bb96e7f45dc6ac1d6ec12190f1f286e3014fb068) Signed-off-by: Lee Jones <joneslee@google.com>	2023-04-03 10:25:13 +00:00
Bart Van Assche	00b5bc6421	blk-mq: Swap two calls in blk_mq_exit_queue() [ Upstream commit 630ef623ed26c18a457cdc070cf24014e50129c2 ] If a tag set is shared across request queues (e.g. SCSI LUNs) then the block layer core keeps track of the number of active request queues in tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is cleared by blk_mq_del_queue_tag_set(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Fixes: `0d2602ca30` ("blk-mq: improve support for shared tags maps") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-05-22 10:59:46 +02:00
Ming Lei	6ff18507a7	blk-mq: don't hold q->sysfs_lock in blk_mq_map_swqueue commit c6ba933358f0d7a6a042b894dba20cc70396a6d3 upstream. blk_mq_map_swqueue() is called from blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues(). For the former caller, the kobject isn't exposed to userspace yet. For the latter caller, hctx sysfs entries and debugfs are un-registered before updating nr_hw_queues. On the other hand, commit 2f8f1336a48b ("blk-mq: always free hctx after request queue is freed") moves freeing hctx into queue's release handler, so there won't be race with queue release path too. So don't hold q->sysfs_lock in blk_mq_map_swqueue(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-02-13 13:51:15 +01:00
Johannes Thumshirn	7894076fbc	block: factor out requeue handling from dispatch code [ Upstream commit c92a41031a6d57395889b5c87cea359220a24d2a ] Factor out the requeue handling from the dispatch code, this will make subsequent addition of different requeueing schemes easier. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2020-12-30 11:25:45 +01:00
Ming Lei	77064570e4	blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART commit d7d8535f377e9ba87edbf7fbbd634ac942f3f54f upstream. SCHED_RESTART code path is relied to re-run queue for dispatch requests in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding requests to hctx->dispatch. memory barriers have to be used for ordering the following two pair of OPs: 1) adding requests to hctx->dispatch and checking SCHED_RESTART in blk_mq_dispatch_rq_list() 2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch in blk_mq_sched_restart(). Without the added memory barrier, either: 1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in dispatch side or 2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue in dispatch side, meantime checking if there is request in hctx->dispatch from blk_mq_sched_restart() is missed. IO hang in ltp/fs_fill test is reported by kernel test robot: https://lkml.org/lkml/2020/7/26/77 Turns out it is caused by the above out-of-order OPs. And the IO hang can't be observed any more after applying this patch. Fixes: `bd166ef183` ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2020-09-03 11:24:26 +02:00
Yufen Yu	82652c06f9	block: fix null pointer dereference in blk_mq_rq_timed_out() commit 8d6996630c03d7ceeabe2611378fea5ca1c3f1b3 upstream. We got a null pointer deference BUG_ON in blk_mq_rq_timed_out() as following: [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040 [ 108.827059] PGD 0 P4D 0 [ 108.827313] Oops: 0000 [#1] SMP PTI [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431 [ 108.829503] Workqueue: kblockd blk_mq_timeout_work [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330 [ 108.838191] Call Trace: [ 108.838406] bt_iter+0x74/0x80 [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450 [ 108.839074] ? __switch_to_asm+0x34/0x70 [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f [ 108.840732] blk_mq_timeout_work+0x74/0x200 [ 108.841151] process_one_work+0x297/0x680 [ 108.841550] worker_thread+0x29c/0x6f0 [ 108.841926] ? rescuer_thread+0x580/0x580 [ 108.842344] kthread+0x16a/0x1a0 [ 108.842666] ? kthread_flush_work+0x170/0x170 [ 108.843100] ret_from_fork+0x35/0x40 The bug is caused by the race between timeout handle and completion for flush request. When timeout handle function blk_mq_rq_timed_out() try to read 'req->q->mq_ops', the 'req' have completed and reinitiated by next flush request, which would call blk_rq_init() to clear 'req' as 0. After commit `12f5b93145` ("blk-mq: Remove generation seqeunce"), normal requests lifetime are protected by refcount. Until 'rq->ref' drop to zero, the request can really be free. Thus, these requests cannot been reused before timeout handle finish. However, flush request has defined .end_io and rq->end_io() is still called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq' can be reused by the next flush request handle, resulting in null pointer deference BUG ON. We fix this problem by covering flush request with 'rq->ref'. If the refcount is not zero, flush_end_io() return and wait the last holder recall it. To record the request status, we add a new entry 'rq_status', which will be used in flush_end_io(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org # v4.18+ Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> ------- v2: - move rq_status from struct request to struct blk_flush_queue v3: - remove unnecessary '{}' pair. v4: - let spinlock to protect 'fq->rq_status' v5: - move rq_status after flush_running_idx member of struct blk_flush_queue Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-05 13:10:08 +02:00
zhengbin	40cdc71e11	blk-mq: move cancel of requeue_work to the front of blk_exit_queue [ Upstream commit e26cc08265dda37d2acc8394604f220ef412299d ] blk_exit_queue will free elevator_data, while blk_mq_requeue_work will access it. Move cancel of requeue_work to the front of blk_exit_queue to avoid use-after-free. blk_exit_queue blk_mq_requeue_work __elevator_exit blk_mq_run_hw_queues blk_mq_exit_sched blk_mq_run_hw_queue dd_exit_queue blk_mq_hctx_has_pending kfree(elevator_data) blk_mq_sched_has_work dd_has_work Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release") Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-10-01 08:26:10 +02:00
Jianchao Wang	313efb253d	blk-mq: change gfp flags to GFP_NOIO in blk_mq_realloc_hw_ctxs [ Upstream commit 5b202853ffbc54b29f23c4b1b5f3948efab489a2 ] blk_mq_realloc_hw_ctxs could be invoked during update hw queues. At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL to GFP_NOIO to avoid forever hang during memory allocation in blk_mq_realloc_hw_ctxs. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-10-01 08:26:10 +02:00
Ming Lei	e238e6dc22	blk-mq: free hw queue's resource in hctx's release handler [ Upstream commit c7e2d94b3d1634988a95ac4d77a72dc7487ece06 ] Once blk_cleanup_queue() returns, tags shouldn't be used any more, because blk_mq_free_tag_set() may be called. Commit `45a9c9d909` ("blk-mq: Fix a use-after-free") fixes this issue exactly. However, that commit introduces another issue. Before `45a9c9d909`, we are allowed to run queue during cleaning up queue if the queue's kobj refcount is held. After that commit, queue can't be run during queue cleaning up, otherwise oops can be triggered easily because some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue(). We have invented ways for addressing this kind of issue before, such as: 8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done") `c2856ae2f3` ("blk-mq: quiesce queue before freeing queue") But still can't cover all cases, recently James reports another such kind of issue: https://marc.info/?l=linux-scsi&m=155389088124782&w=2 This issue can be quite hard to address by previous way, given scsi_run_queue() may run requeues for other LUNs. Fixes the above issue by freeing hctx's resources in its release handler, and this way is safe becasue tags isn't needed for freeing such hctx resource. This approach follows typical design pattern wrt. kobject's release handler. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reported-by: James Smart <james.smart@broadcom.com> Fixes: `45a9c9d909` ("blk-mq: Fix a use-after-free") Cc: stable@vger.kernel.org Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-09-16 08:22:13 +02:00
Ming Lei	525b5265fd	blk-mq: move cancel of requeue_work into blk_mq_release [ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ] With holding queue's kobject refcount, it is safe for driver to schedule requeue. However, blk_mq_kick_requeue_list() may be called after blk_sync_queue() is done because of concurrent requeue activities, then requeue work may not be completed when freeing queue, and kernel oops is triggered. So moving the cancel of requeue_work into blk_mq_release() for avoiding race between requeue and freeing queue. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-06-15 11:54:06 +02:00
Jens Axboe	824c212908	bfq: update internal depth state when queue depth changes commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream A previous commit moved the shallow depth and BFQ depth map calculations to be done at init time, moving it outside of the hotter IO path. This potentially causes hangs if the users changes the depth of the scheduler map, by writing to the 'nr_requests' sysfs file for that device. Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if the depth changes, so that the scheduler can update its internal state. Signed-off-by: Eric Wheeler <bfq@linux.ewheeler.net> Tested-by: Kai Krakow <kai@kaishome.de> Reported-by: Paolo Valente <paolo.valente@linaro.org> Fixes: `f0635b8a41` ("bfq: calculate shallow depths at init time") Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-05-16 19:41:17 +02:00
Shenghui Wang	e5be04ee17	block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx [ Upstream commit b9a1ff504b9492ad6beb7d5606e0e3365d4d8499 ] kfree() can leak the hctx->fq->flush_rq field. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-05-08 07:21:49 +02:00
Jianchao Wang	29452f665c	blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue [ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ] When requeue, if RQF_DONTPREP, rq has contained some driver specific data, so insert it to hctx dispatch list to avoid any merge. Take scsi as example, here is the trace event log (no io scheduler, because RQF_STARTED would prevent merging), kworker/0:1H-339 [000] ...1 2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H] scsi_inert_test-1987 [000] .... 2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test] scsi_inert_test-1987 [000] ...2 2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test] kworker/0:1H-339 [000] .... 2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H] scsi_inert_test-1996 [000] ..s1 2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0] scsi_inert_test-1996 [000] .Ns1 2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0] kworker/0:1H-339 [000] ...1 2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] kworker/0:1H-339 [000] ...1 2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H] scsi_inert_test-1986 [000] ..s1 2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0] (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP. Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP, the sdb only contained the part of (32768 + 8), then only that part was completed. The lucky thing was that scsi_io_completion detected it and requeued the remaining part. So we didn't get corrupted data. However, the requeue of (32776 + 8) is not expected. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:45 +01:00
Jens Axboe	55cbeea76e	blk-mq: punt failed direct issue to dispatch list commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream. After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. This results in a livelock. Instead of making special cases for what we can direct issue, and now having to deal with DM solving the livelock while still retaining a BUSY condition feedback loop, always just add a request that has been through ->queue_rq() to the hardware queue dispatch list. These are safe to use as no merging can take place there. Additionally, if requests do have prepped data from drivers, we aren't dependent on them not sharing space in the request structure to safely add them to the IO scheduler lists. This basically reverts ffe81d45322c and is based on a patch from Ming, but with the list insert case covered as well. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Cc: stable@vger.kernel.org Suggested-by: Ming Lei <ming.lei@redhat.com> Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Ming Lei <ming.lei@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-12-08 12:59:10 +01:00
Jens Axboe	724ff9cbfe	blk-mq: fix corruption with direct issue commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream. If we attempt a direct issue to a SCSI device, and it returns BUSY, then we queue the request up normally. However, the SCSI layer may have already setup SG tables etc for this particular command. If we later merge with this request, then the old tables are no longer valid. Once we issue the IO, we only read/write the original part of the request, not the new state of it. This causes data corruption, and is most often noticed with the file system complaining about the just read data being invalid: [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) because most of it is garbage... This doesn't happen from the normal issue path, as we will simply defer the request to the hardware queue dispatch list if we fail. Once it's on the dispatch list, we never merge with it. Fix this from the direct issue path by flagging the request as REQ_NOMERGE so we don't change the size of it before issue. See also: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: `6ce3dd6eec` ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-12-08 12:59:06 +01:00
Ilya Dryomov	587562d0c7	blk-mq: I/O and timer unplugs are inverted in blktrace trace_block_unplug() takes true for explicit unplugs and false for implicit unplugs. schedule() unplugs are implicit and should be reported as timer unplugs. While correct in the legacy code, this has been inverted in blk-mq since 4.11. Cc: stable@vger.kernel.org Fixes: `bd166ef183` ("blk-mq-sched: add framework for MQ capable IO schedulers") Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-27 13:12:44 -06:00
Linus Torvalds	5bed49adfe	Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: - Set of bcache fixes and changes (Coly) - The flush warn fix (me) - Small series of BFQ fixes (Paolo) - wbt hang fix (Ming) - blktrace fix (Steven) - blk-mq hardware queue count update fix (Jianchao) - Various little fixes * tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits) block/DAC960.c: make some arrays static const, shrinks object size blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter blk-mq: init hctx sched after update ctx and hctx mapping block: remove duplicate initialization tracing/blktrace: Fix to allow setting same value pktcdvd: fix setting of 'ret' error return for a few cases block: change return type to bool block, bfq: return nbytes and not zero from struct cftype .write() method block, bfq: improve code of bfq_bfqq_charge_time block, bfq: reduce write overcharge block, bfq: always update the budget of an entity when needed block, bfq: readd missing reset of parent-entity service blk-wbt: fix IO hang in wbt_wait() block: don't warn for flush on read-only device bcache: add the missing comments for smp_mb()/smp_wmb() bcache: remove unnecessary space before ioctl function pointer arguments bcache: add missing SPDX header bcache: move open brace at end of function definitions to next line bcache: add static const prefix to char * array declarations bcache: fix code comments style ...	2018-08-22 13:38:05 -07:00
Jianchao Wang	f5bbbbe4d6	blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to account the inflight requests. It will access the queue_hw_ctx and nr_hw_queues w/o any protection. When updating nr_hw_queues and blk_mq_in_flight/rw occur concurrently, panic comes up. Before update nr_hw_queues, the q will be frozen. So we could use q_usage_counter to avoid the race. percpu_ref_is_zero is used here so that we will not miss any in-flight request. The access to nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are under rcu critical section, __blk_mq_update_nr_hw_queues could use synchronize_rcu to ensure the zeroed q_usage_counter to be globally visible. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-21 09:02:56 -06:00
Jianchao Wang	d48ece209f	blk-mq: init hctx sched after update ctx and hctx mapping Currently, when update nr_hw_queues, IO scheduler's init_hctx will be invoked before the mapping between ctx and hctx is adapted correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber) may depend on this mapping and get wrong result and panic finally. A simply way to fix this is that switch the IO scheduler to 'none' before update the nr_hw_queues, and then switch it back after update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due to nobody use them any more. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-21 09:02:55 -06:00
Linus Torvalds	73ba2fb33c	Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...	2018-08-14 10:23:25 -07:00
Jianchao Wang	d263ed9926	blk-mq: count the hctx as active before allocating tag Currently, we count the hctx as active after allocate driver tag successfully. If a previously inactive hctx try to get tag first time, it may fails and need to wait. However, due to the stale tag ->active_queues, the other shared-tags users are still able to occupy all driver tags while there is someone waiting for tag. Consequently, even if the previously inactive hctx is waked up, it still may not be able to get a tag and could be starved. To fix it, we count the hctx as active before try to allocate driver tag, then when it is waiting the tag, the other shared-tag users will reserve budget for it. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-09 08:34:17 -06:00
Linus Torvalds	eb181a814c	Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Bigger than usual at this time, mostly due to the O_DIRECT corruption issue and the fact that I was on vacation last week. This contains: - NVMe pull request with two fixes for the FC code, and two target fixes (Christoph) - a DIF bio reset iteration fix (Greg Edwards) - two nbd reply and requeue fixes (Josef) - SCSI timeout fixup (Keith) - a small series that fixes an issue with bio_iov_iter_get_pages(), which ended up causing corruption for larger sized O_DIRECT writes that ended up racing with buffered writes (Martin Wilck)" * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block: block: reset bi_iter.bi_done after splitting bio block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs blkdev: __blkdev_direct_IO_simple: fix leak in error case block: bio_iov_iter_get_pages: fix size of last iovec nvmet: only check for filebacking on -ENOTBLK nvmet: fixup crash on NULL device path scsi: set timed out out mq requests to complete blk-mq: export setting request completion state nvme: if_ready checks to fail io to deleting controller nvmet-fc: fix target sgl list on large transfers nbd: handle unexpected replies better nbd: don't requeue the same request twice.	2018-07-27 12:51:00 -07:00
Keith Busch	0fc09f9209	blk-mq: export setting request completion state This is preparing for drivers that want to directly alter the state of their requests. No functional change here. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:41:50 -06:00
Ming Lei	8824f62246	blk-mq: fail the request in case issue failure Inside blk_mq_try_issue_list_directly(), if the request is issued as failed, we shouldn't try to do it again, otherwise the warning in blk_mq_start_request() will be triggered. This change is aligned to behaviour of other ways of request issue & dispatch. Fixes: `6ce3dd6eec` ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: LKP <lkp@01.org> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-22 17:31:18 -06:00
Ming Lei	6ce3dd6eec	blk-mq: issue directly if hw queue isn't busy in case of 'none' In case of 'none' io scheduler, when hw queue isn't busy, it isn't necessary to enqueue request to sw queue and dequeue it from sw queue because request may be submitted to hw queue asap without extra cost, meantime there shouldn't be much request in sw queue, and we don't need to worry about effect on IO merge. There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...) which may connect high performance devices, so 'none' is often required for obtaining good performance. This patch improves IOPS and decreases CPU unilization on megaraid_sas, per Kashyap's test. Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-17 16:04:00 -06:00
Josef Bacik	c1c80384c8	block: remove external dependency on wbt_flags We don't really need to save this stuff in the core block code, we can just pass the bio back into the helpers later on to derive the same flags and update the rq->wbt_flags appropriately. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:54 -06:00
Josef Bacik	a79050434b	blk-rq-qos: refactor out common elements of blk-wbt blkcg-qos is going to do essentially what wbt does, only on a cgroup basis. Break out the common code that will be shared between blkcg-qos and wbt into blk-rq-qos.* so they can both utilize the same infrastructure. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:54 -06:00
Ming Lei	6e76871730	blk-mq: dequeue request one by one from sw queue if hctx is busy It won't be efficient to dequeue request one by one from sw queue, but we have to do that when queue is busy for better merge performance. This patch takes the Exponential Weighted Moving Average(EWMA) to figure out if queue is busy, then only dequeue request one by one from sw queue when queue is busy. Fixes: `b347689ffb` ("blk-mq-sched: improve dispatching from sw queue") Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:53 -06:00
Ming Lei	3f0cedc7e9	blk-mq: use list_splice_tail_init() to insert requests list_splice_tail_init() is much more faster than inserting each request one by one, given all requets in 'list' belong to same sw queue and ctx->lock is required to insert requests. Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:53 -06:00
Minwoo Im	c018c84fdb	blk-mq: fix typo in a function comment Fix typo in a function blk_mq_alloc_tag_set() comment. if if it too large -> if it's too large. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:53 -06:00
Minwoo Im	0da73d00ca	blk-mq: code clean-up by adding an API to clear set->mq_map set->mq_map is now currently cleared if something goes wrong when establishing a queue map in blk-mq-pci.c. It's also cleared before updating a queue map in blk_mq_update_queue_map(). This patch provides an API to clear set->mq_map to make it clear. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:53 -06:00
Ming Lei	97889f9ac2	blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() We have to remove synchronize_rcu() from blk_queue_cleanup(), otherwise long delay can be caused during lun probe. For removing it, we have to avoid to iterate the set->tag_list in IO path, eg, blk_mq_sched_restart(). This patch reverts 5b79413946d (Revert "blk-mq: don't handle TAG_SHARED in restart"). Given we have fixed enough IO hang issue, and there isn't any reason to restart all queues in one tags any more, see the following reasons: 1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(), which can wake up queues waiting for driver tag. 2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue, target or host is ready, but SCSI built-in restart can cover all these well, see scsi_end_request(), queue will be rerun after any request initiated from this host/target is completed. In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30% when running I/O on these 8 luns concurrently. Fixes: `705cda97ee` ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org Reported-by: Andrew Jones <drjones@redhat.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Ming Lei	5815839b3c	blk-mq: introduce new lock for protecting hctx->dispatch_wait Now hctx->lock is only acquired when adding hctx->dispatch_wait to one wait queue, but not held when removing it from the wait queue. IO hang can be observed easily if SCHED RESTART is disabled, that means now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait(). This patch fixes the issue by introducing hctx->dispatch_wait_lock and holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(), since we need to avoid acquiring hctx->lock in irq context. Fixes: `eb619fdb2d` ("blk-mq: fix issue with shared tag queue re-running") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Ming Lei	2278d69f03	blk-mq: don't pass hctx to blk_mq_mark_tag_wait() 'hctx' won't be changed at all, so not necessary to pass 'hctx' to blk_mq_mark_tag_wait(). Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Ming Lei	8ab6bb9ee8	blk-mq: cleanup blk_mq_get_driver_tag() We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence we never change '**hctx' as well. The last use of these went away with the flush cleanup, commit `0c2a6fe4dc`. So cleanup the usage and remove the two extra parameters. Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Tested-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-09 09:07:52 -06:00
Linus Torvalds	e6e5bec43c	Merge tag 'for-linus-20180629' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Small set of fixes for this series. Mostly just minor fixes, the only oddball in here is the sg change. The sg change came out of the stall fix for NVMe, where we added a mempool and limited us to a single page allocation. CONFIG_SG_DEBUG sort-of ruins that, since we'd need to account for that. That's actually a generic problem, since lots of drivers need to allocate SG lists. So this just removes support for CONFIG_SG_DEBUG, which I added back in 2007 and to my knowledge it was never useful. Anyway, outside of that, this pull contains: - clone of request with special payload fix (Bart) - drbd discard handling fix (Bart) - SATA blk-mq stall fix (me) - chunk size fix (Keith) - double free nvme rdma fix (Sagi)" * tag 'for-linus-20180629' of git://git.kernel.dk/linux-block: sg: remove ->sg_magic member drbd: Fix drbd_request_prepare() discard handling blk-mq: don't queue more if we get a busy return block: Fix cloning of requests with a special payload nvme-rdma: fix possible double free of controller async event buffer block: Fix transfer when chunk sectors exceeds max	2018-06-30 10:47:46 -07:00
Jens Axboe	1f57f8d442	blk-mq: don't queue more if we get a busy return Some devices have different queue limits depending on the type of IO. A classic case is SATA NCQ, where some commands can queue, but others cannot. If we have NCQ commands inflight and encounter a non-queueable command, the driver returns busy. Currently we attempt to dispatch more from the scheduler, if we were able to queue some commands. But for the case where we ended up stopping due to BUSY, we should not attempt to retrieve more from the scheduler. If we do, we can get into a situation where we attempt to queue a non-queueable command, get BUSY, then successfully retrieve more commands from that scheduler and queue those. This can repeat forever, starving the non-queuable command indefinitely. Fix this by NOT attempting to pull more commands from the scheduler, if we get a BUSY return. This should also be more optimal in terms of letting requests stay in the scheduler for as long as possible, if we get a BUSY due to the regular out-of-tags condition. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-06-29 07:52:31 -06:00
Linus Torvalds	77072ca59f	Merge tag 'for-linus-20180623' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Further timeout fixes. We aren't quite there yet, so expect another round of fixes for that to completely close some of the IRQ vs completion races. (Christoph/Bart) - Set of NVMe fixes from the usual suspects, mostly error handling - Two off-by-one fixes (Dan) - Another bdi race fix (Jan) - Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron) * tag 'for-linus-20180623' of git://git.kernel.dk/linux-block: blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE bdi: Fix another oops in wb_workfn() lightnvm: Remove depends on HAS_DMA in case of platform dependency nvme-pci: limit max IO size and segments to avoid high order allocations nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl nvme-fc: release io queues to allow fast fail nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag. block: sed-opal: Fix a couple off by one bugs blk-mq-debugfs: Off by one in blk_mq_rq_state_name() nvmet: reset keep alive timer in controller enable nvme-rdma: don't override opts->queue_size nvme-rdma: Fix command completion race at error recovery nvme-rdma: fix possible free of a non-allocated async event buffer nvme-rdma: fix possible double free condition when failing to create a controller Revert "block: Add warning for bi_next not NULL in bio_endio()" block: fix timeout changes for legacy request drivers	2018-06-24 06:33:54 +08:00
Bart Van Assche	f5e350f021	blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE Make sure that RQF_TIMED_OUT is cleared when a request is reused after a block driver timeout handler has returned BLK_EH_DONE. Fixes: `da66126739` ("blk-mq: don't time out requests again that are in the timeout handler") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Andrew Randrianasulu <randrianasulu@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-06-23 10:25:45 -06:00
Linus Torvalds	265c5596da	Merge tag 'for-linus-20180616' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A collection of fixes that should go into -rc1. This contains: - bsg_open vs bsg_unregister race fix (Anatoliy) - NVMe pull request from Christoph, with fixes for regressions in this window, FC connect/reconnect path code unification, and a trace point addition. - timeout fix (Christoph) - remove a few unused functions (Christoph) - blk-mq tag_set reinit fix (Roman)" * tag 'for-linus-20180616' of git://git.kernel.dk/linux-block: bsg: fix race of bsg_open and bsg_unregister block: remov blk_queue_invalidate_tags nvme-fabrics: fix and refine state checks in __nvmf_check_ready nvme-fabrics: handle the admin-only case properly in nvmf_check_ready nvme-fabrics: refactor queue ready check blk-mq: remove blk_mq_tagset_iter nvme: remove nvme_reinit_tagset nvme-fc: fix nulling of queue data on reconnect nvme-fc: remove reinit_request routine blk-mq: don't time out requests again that are in the timeout handler nvme-fc: change controllers first connect to use reconnect path nvme: don't rely on the changed namespace list log nvmet: free smart-log buffer after use nvme-rdma: fix error flow during mapping request data nvme: add bio remapping tracepoint nvme: fix NULL pointer dereference in nvme_init_subsystem blk-mq: reinit q->tag_set_list entry only after grace period	2018-06-17 05:37:55 +09:00
Christoph Hellwig	da66126739	blk-mq: don't time out requests again that are in the timeout handler We can currently call the timeout handler again on a request that has already been handed over to the timeout handler. Prevent that with a new flag. Fixes: `12f5b931` ("blk-mq: Remove generation seqeunce") Reported-by: Andrew Randrianasulu <randrianasulu@gmail.com> Tested-by: Andrew Randrianasulu <randrianasulu@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-06-14 08:39:46 -06:00
Kees Cook	590b5b7d86	treewide: kzalloc_node() -> kcalloc_node() The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This patch replaces cases of: kzalloc_node(a * b, gfp, node) with: kcalloc_node(a * b, gfp, node) as well as handling cases of: kzalloc_node(a * b * c, gfp, node) with: kzalloc_node(array3_size(a, b, c), gfp, node) as it's slightly less ugly than: kcalloc_node(array_size(a, b), c, gfp, node) This does, however, attempt to ignore constant size factors like: kzalloc_node(4 * 1024, gfp, node) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc_node( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) \| kzalloc_node( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc_node( - sizeof(u8) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(__u8) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(char) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(unsigned char) * (COUNT) + COUNT , ...) \| kzalloc_node( - sizeof(u8) * COUNT + COUNT , ...) \| kzalloc_node( - sizeof(__u8) * COUNT + COUNT , ...) \| kzalloc_node( - sizeof(char) * COUNT + COUNT , ...) \| kzalloc_node( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc_node + kcalloc_node ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc_node( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) \| kzalloc_node( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc_node( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc_node( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) \| kzalloc_node( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc_node( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) \| kzalloc_node( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc_node( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) \| kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) \| kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc_node( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) \| kzalloc_node( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc_node(C1 * C2 * C3, ...) \| kzalloc_node( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc_node( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) \| kzalloc_node( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) \| kzalloc_node( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc_node(sizeof(THING) * C2, ...) \| kzalloc_node(sizeof(TYPE) * C2, ...) \| kzalloc_node(C1 * C2 * C3, ...) \| kzalloc_node(C1 * C2, ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) \| - kzalloc_node + kcalloc_node ( - (E1) * E2 + E1, E2 , ...) \| - kzalloc_node + kcalloc_node ( - (E1) * (E2) + E1, E2 , ...) \| - kzalloc_node + kcalloc_node ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>	2018-06-12 16:19:22 -07:00
Roman Pen	a347c7ad8e	blk-mq: reinit q->tag_set_list entry only after grace period It is not allowed to reinit q->tag_set_list list entry while RCU grace period has not completed yet, otherwise the following soft lockup in blk_mq_sched_restart() happens: [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270] [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000 [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150 [ 1064.256510] Call Trace: [ 1064.256664] <IRQ> [ 1064.256824] blk_mq_free_request+0xea/0x100 [ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client] [ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client] [ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core] [ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client] [ 1064.257669] ib_create_qp+0x321/0x380 [ib_core] [ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core] [ 1064.258007] irq_poll_softirq+0xb7/0xe0 [ 1064.258165] __do_softirq+0x106/0x2a2 [ 1064.258328] irq_exit+0x92/0xa0 [ 1064.258509] do_IRQ+0x4a/0xd0 [ 1064.258660] common_interrupt+0x7a/0x7a [ 1064.258818] </IRQ> Meanwhile another context frees other queue but with the same set of shared tags: [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds. [ 1288.201833] bash D 0 5910 5820 0x00000000 [ 1288.202016] Call Trace: [ 1288.202315] schedule+0x32/0x80 [ 1288.202462] schedule_timeout+0x1e5/0x380 [ 1288.203838] wait_for_completion+0xb0/0x120 [ 1288.204137] __wait_rcu_gp+0x125/0x160 [ 1288.204287] synchronize_sched+0x6e/0x80 [ 1288.204770] blk_mq_free_queue+0x74/0xe0 [ 1288.204922] blk_cleanup_queue+0xc7/0x110 [ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client] [ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client] [ 1288.205548] kernfs_fop_write+0x109/0x180 [ 1288.206328] vfs_write+0xb3/0x1a0 [ 1288.206476] SyS_write+0x52/0xc0 [ 1288.206624] do_syscall_64+0x68/0x1d0 [ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 What happened is the following: 1. There are several MQ queues with shared tags. 2. One queue is about to be freed and now task is in blk_mq_del_queue_tag_set(). 3. Other CPU is in blk_mq_sched_restart() and loops over all queues in tag list in order to find hctx to restart. Because linked list entry was modified in blk_mq_del_queue_tag_set() without proper waiting for a grace period, blk_mq_sched_restart() never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup. Fix is simple: reinit list entry after an RCU grace period elapsed. Fixes: Fixes: `705cda97ee` ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") Cc: stable@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-block@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-06-11 08:13:41 -06:00
Jianchao Wang	0196d6b408	blk-mq: return when hctx is stopped in blk_mq_run_work_fn If a hardware queue is stopped, it should not be run again before explicitly started. Ignore stopped queues in blk_mq_run_work_fn(), fixing a regression recently introduced when the START_ON_RUN bit was removed. Fixes: `15fe8a90bb` ("blk-mq: remove blk_mq_delay_queue()") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-06-04 12:10:40 -06:00
Christoph Hellwig	131d08e122	block: split the blk-mq case from elevator_init There is almost no shared logic, which leads to a very confusing code flow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-06-01 07:38:21 -06:00
Christoph Hellwig	acddf3b308	block: move sysfs_lock into elevator_init Both callers take just around so function call, so move it in. Also remove the now pointless blk_mq_sched_init wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-06-01 07:38:19 -06:00

1 2 3 4 5 ...

620 Commits