kernel_xiaomi_sm8250

Author	SHA1	Message	Date
Michael Bestas	1b59618ce4	Merge tag 'ASB-2023-09-05_4.19-stable' of https://android.googlesource.com/kernel/common into android13-4.19-kona https://source.android.com/docs/security/bulletin/2023-09-01 * tag 'ASB-2023-09-05_4.19-stable' of https://android.googlesource.com/kernel/common: Linux 4.19.294 Revert "ARM: ep93xx: fix missing-prototype warnings" Revert "MIPS: Alchemy: fix dbdma2" Linux 4.19.293 dma-buf/sw_sync: Avoid recursive lock during fence signal clk: Fix undefined reference to `clk_rate_exclusive_{get,put}' scsi: core: raid_class: Remove raid_component_add() scsi: snic: Fix double free in snic_tgt_create() irqchip/mips-gic: Don't touch vl_map if a local interrupt is not routable rtnetlink: Reject negative ifindexes in RTM_NEWLINK netfilter: nf_queue: fix socket leak sched/rt: pick_next_rt_entity(): check list_entry mmc: block: Fix in_flight[issue_type] value error x86/fpu: Set X86_FEATURE_OSXSAVE feature after enabling OSXSAVE in CR4 PCI: acpiphp: Use pci_assign_unassigned_bridge_resources() only for non-root bus media: vcodec: Fix potential array out-of-bounds in encoder queue_setup lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels batman-adv: Fix batadv_v_ogm_aggr_send memory leak batman-adv: Fix TT global entry leak when client roamed back batman-adv: Do not get eth header before batadv_check_management_packet batman-adv: Don't increase MTU when set by user batman-adv: Trigger events for auto adjusted MTU nfsd: Fix race to FREE_STATEID and cl_revoked ibmveth: Use dcbf rather than dcbfl ipvs: fix racy memcpy in proc_do_sync_threshold ipvs: Improve robustness to the ipvs sysctl bonding: fix macvlan over alb bond support net: remove bond_slave_has_mac_rcu() net/sched: fix a qdisc modification with ambiguous command request igb: Avoid starting unnecessary workqueues dccp: annotate data-races in dccp_poll() sock: annotate data-races around prot->memory_pressure tracing: Fix memleak due to race between current_tracer and trace drm/amd/display: check TG is non-null before checking if enabled drm/amd/display: do not wait for mpc idle if tg is disabled regmap: Account for register length in SMBus I/O limits dm integrity: reduce vmalloc space footprint on 32-bit architectures dm integrity: increase RECALC_SECTORS to improve recalculate speed powerpc: Fail build if using recordmcount with binutils v2.37 powerpc: remove leftover code of old GCC version checks powerpc/32: add stack protector support fbdev: fix potential OOB read in fast_imageblit() fbdev: Fix sys_imageblit() for arbitrary image widths fbdev: Improve performance of sys_imageblit() tty: serial: fsl_lpuart: add earlycon for imx8ulp platform Revert "tty: serial: fsl_lpuart: drop earlycon entry for i.MX8QXP" MIPS: cpu-features: Use boot_cpu_type for CPU type based features MIPS: cpu-features: Enable octeon_cache by cpu_type fs: dlm: fix mismatch of plock results from userspace fs: dlm: use dlm_plock_info for do_unlock_close fs: dlm: change plock interrupted message to debug again fs: dlm: add pid to debug log dlm: replace usage of found with dedicated list iterator variable dlm: improve plock logging if interrupted PCI: acpiphp: Reassign resources on bridge if necessary net: phy: broadcom: stub c45 read/write for 54810 net: xfrm: Amend XFRMA_SEC_CTX nla_policy structure net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled virtio-net: set queues after driver_ok af_unix: Fix null-ptr-deref in unix_stream_sendpage(). netfilter: set default timeout to 3 secs for sctp shutdown send and recv state test_firmware: prevent race conditions by a correct implementation of locking mmc: wbsd: fix double mmc_free_host() in wbsd_init() cifs: Release folio lock on fscache read hit. ALSA: usb-audio: Add support for Mythware XA001AU capture and playback interfaces. serial: 8250: Fix oops for port->pm on uart_change_pm() ASoC: meson: axg-tdm-formatter: fix channel slot allocation ASoC: rt5665: add missed regulator_bulk_disable net: do not allow gso_size to be set to GSO_BY_FRAGS sock: Fix misuse of sk_under_memory_pressure() i40e: fix misleading debug logs team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves netfilter: nft_dynset: disallow object maps selftests: mirror_gre_changes: Tighten up the TTL test match xfrm: add NULL check in xfrm_update_ae_params ip_vti: fix potential slab-use-after-free in decode_session6 ip6_vti: fix slab-use-after-free in decode_session6 xfrm: fix slab-use-after-free in decode_session6 xfrm: interface: rename xfrm_interface.c to xfrm_interface_core.c net: af_key: fix sadb_x_filter validation net: xfrm: Fix xfrm_address_filter OOB read btrfs: fix BUG_ON condition in btrfs_cancel_balance powerpc/rtas_flash: allow user copy to flash block cache objects fbdev: mmp: fix value check in mmphw_probe() virtio-mmio: don't break lifecycle of vm_dev virtio-mmio: Use to_virtio_mmio_device() to simply code virtio-mmio: convert to devm_platform_ioremap_resource nfsd: Remove incorrect check in nfsd4_validate_stateid nfsd4: kill warnings on testing stateids with mismatched clientids block: fix signed int overflow in Amiga partition support mmc: sunxi: fix deferred probing mmc: bcm2835: fix deferred probing mmc: Remove dev_err() usage after platform_get_irq() mmc: tmio: move tmio_mmc_set_clock() to platform hook mmc: tmio: replace tmio_mmc_clk_stop() calls with tmio_mmc_set_clock() mmc: meson-gx: remove redundant mmc_request_done() call from irq context mmc: meson-gx: remove useless lock USB: dwc3: qcom: fix NULL-deref on suspend usb: dwc3: qcom: Add helper functions to enable,disable wake irqs irqchip/mips-gic: Use raw spinlock for gic_lock irqchip/mips-gic: Get rid of the reliance on irq_cpu_online() x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms powerpc/64s/radix: Fix soft dirty tracking powerpc: Move page table dump files in a dedicated subdirectory powerpc/mm: dump block address translation on book3s/32 powerpc/mm: dump segment registers on book3s/32 powerpc/mm: Move pgtable_t into platform headers powerpc/mm: move platform specific mmu-xxx.h in platform directories iio: addac: stx104: Fix race condition when converting analog-to-digital iio: addac: stx104: Fix race condition for stx104_write_raw() iio: adc: stx104: Implement and utilize register structures iio: adc: stx104: Utilize iomap interface iio: add addac subdirectory IMA: allow/fix UML builds drm/amdgpu: Fix potential fence use-after-free v2 Bluetooth: L2CAP: Fix use-after-free pcmcia: rsrc_nonstatic: Fix memory leak in nonstatic_release_resource_db() gfs2: Fix possible data races in gfs2_show_options() media: platform: mediatek: vpu: fix NULL ptr dereference media: v4l2-mem2mem: add lock to protect parameter num_rdy FS: JFS: Check for read-only mounted filesystem in txBegin FS: JFS: Fix null-ptr-deref Read in txBegin MIPS: dec: prom: Address -Warray-bounds warning fs: jfs: Fix UBSAN: array-index-out-of-bounds in dbAllocDmapLev udf: Fix uninitialized array access for some pathnames HID: add quirk for 03f0:464a HP Elite Presenter Mouse quota: fix warning in dqgrab() quota: Properly disable quotas when add_dquot_ref() fails ALSA: emu10k1: roll up loops in DSP setup code for Audigy drm/radeon: Fix integer overflow in radeon_cs_parser_init selftests: forwarding: tc_flower: Relax success criterion lib/mpi: Eliminate unused umul_ppmm definitions for MIPS Revert "posix-timers: Ensure timer ID search-loop limit is valid" UPSTREAM: media: usb: siano: Fix warning due to null work_func_t function pointer UPSTREAM: Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb UPSTREAM: net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free UPSTREAM: net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free Linux 4.19.292 sch_netem: fix issues in netem_change() vs get_dist_table() alpha: remove __init annotation from exported page_is_ram() scsi: core: Fix possible memory leak if device_add() fails scsi: snic: Fix possible memory leak if device_add() fails scsi: 53c700: Check that command slot is not NULL scsi: storvsc: Fix handling of virtual Fibre Channel timeouts scsi: core: Fix legacy /proc parsing buffer overflow netfilter: nf_tables: report use refcount overflow netfilter: nf_tables: bogus EBUSY when deleting flowtable after flush btrfs: don't stop integrity writeback too early ibmvnic: Handle DMA unmapping of login buffs in release functions wifi: cfg80211: fix sband iftype data lookup for AP_VLAN IB/hfi1: Fix possible panic during hotplug remove drivers: net: prevent tun_build_skb() to exceed the packet size limit dccp: fix data-race around dp->dccps_mss_cache bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves net/packet: annotate data-races around tp->status mISDN: Update parameter type of dsp_cmx_send() drm/nouveau/disp: Revert a NULL check inside nouveau_connector_get_modes x86: Move gds_ucode_mitigated() declaration to header x86/mm: Fix VDSO and VVAR placement on 5-level paging machines x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405 usb: dwc3: Properly handle processing of pending events usb-storage: alauda: Fix uninit-value in alauda_check_media() binder: fix memory leak in binder_init() iio: cros_ec: Fix the allocation size for cros_ec_command nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput radix tree test suite: fix incorrect allocation size for pthreads drm/nouveau/gr: enable memory loads on helper invocation on all channels dmaengine: pl330: Return DMA_PAUSED when transaction is paused ipv6: adjust ndisc_is_useropt() to also return true for PIO mmc: moxart: read scr register without changing byte order sparc: fix up arch_cpu_finalize_init() build breakage. UPSTREAM: net/sched: cls_fw: Fix improper refcount update leads to use-after-free Linux 4.19.291 drm/edid: fix objtool warning in drm_cvt_modes() arm64: dts: stratix10: fix incorrect I2C property for SCL signal drivers core: Use sysfs_emit and sysfs_emit_at for show(device ...) functions ARM: dts: nxp/imx6sll: fix wrong property name in usbphy node ARM: dts: imx6sll: fixup of operating points ARM: dts: imx: add usb alias ARM: dts: imx6sll: Make ssi node name same as other platforms PM: sleep: wakeirq: fix wake irq arming PM / wakeirq: support enabling wake-up irq after runtime_suspend called powerpc/mm/altmap: Fix altmap boundary check mtd: rawnand: omap_elm: Fix incorrect type in assignment test_firmware: return ENOMEM instead of ENOSPC on failed memory allocation test_firmware: fix a memory leak with reqs buffer ext2: Drop fragment support net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb fs/sysv: Null check to prevent null-ptr-deref bug USB: zaurus: Add ID for A-300/B-500/C-700 libceph: fix potential hang in ceph_osdc_notify() scsi: zfcp: Defer fc_rport blocking until after ADISC response tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen tcp_metrics: annotate data-races around tm->tcpm_net tcp_metrics: annotate data-races around tm->tcpm_vals[] tcp_metrics: annotate data-races around tm->tcpm_lock tcp_metrics: annotate data-races around tm->tcpm_stamp tcp_metrics: fix addr_same() helper ip6mr: Fix skb_under_panic in ip6mr_cache_report() net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free net: add missing data-race annotation for sk_ll_usec net: add missing data-race annotations around sk->sk_peek_off net: sched: cls_u32: Fix match key mis-addressing perf test uprobe_from_different_cu: Skip if there is no gcc net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer() KVM: s390: fix sthyi error handling word-at-a-time: use the same return type for has_zero regardless of endianness loop: Select I/O scheduler 'none' from inside add_disk() perf: Fix function pointer case net/sched: cls_u32: Fix reference counter leak leading to overflow ASoC: cs42l51: fix driver to properly autoload with automatic module loading net/sched: sch_qfq: account for stab overhead in qfq_enqueue net/sched: cls_fw: Fix improper refcount update leads to use-after-free drm/client: Fix memory leak in drm_client_target_cloned dm cache policy smq: ensure IO doesn't prevent cleaner policy progress ASoC: wm8904: Fill the cache for WM8904_ADC_TEST_0 register s390/dasd: fix hanging device after quiesce/resume virtio-net: fix race between set queues and probe serial: 8250_dw: Preserve original value of DLF register serial: 8250_dw: split Synopsys DesignWare 8250 common functions irq-bcm6345-l1: Do not assume a fixed block to cpu mapping tpm_tis: Explicitly check for error code btrfs: check for commit error at btrfs_attach_transaction_barrier() hwmon: (nct7802) Fix for temp6 (PECI1) processed even if PECI1 disabled staging: ks7010: potential buffer overflow in ks_wlan_set_encode_ext() Documentation: security-bugs.rst: clarify CVE handling Documentation: security-bugs.rst: update preferences when dealing with the linux-distros group usb: xhci-mtk: set the dma max_seg_size USB: quirks: add quirk for Focusrite Scarlett usb: ohci-at91: Fix the unhandle interrupt when resume usb: dwc3: don't reset device side if dwc3 was configured as host-only usb: dwc3: pci: skip BYT GPIO lookup table for hardwired phy Revert "usb: dwc3: core: Enable AutoRetry feature in the controller" can: gs_usb: gs_can_close(): add missing set of CAN state to CAN_STATE_STOPPED USB: serial: simple: sort driver entries USB: serial: simple: add Kaufmann RKS+CAN VCP USB: serial: option: add Quectel EC200A module support USB: serial: option: support Quectel EM060K_128 tracing: Fix warning in trace_buffered_event_disable() ring-buffer: Fix wrong stat of cpu_buffer->read ata: pata_ns87415: mark ns87560_tf_read static dm raid: fix missing reconfig_mutex unlock in raid_ctr() error paths block: Fix a source code comment in include/uapi/linux/blkzoned.h ASoC: fsl_spdif: Silence output on stop drm/msm: Fix IS_ERR_OR_NULL() vs NULL check in a5xx_submit_in_rb() RDMA/mlx4: Make check for invalid flags stricter benet: fix return value check in be_lancer_xmit_workarounds() net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64 net/sched: mqprio: add extack to mqprio_parse_nlattr() net/sched: mqprio: refactor nlattr parsing to a separate function platform/x86: msi-laptop: Fix rfkill out-of-sync on MSI Wind U100 team: reset team's flags when down link is P2P device bonding: reset bond's flags when down link is P2P device tcp: Reduce chance of collisions in inet6_hashfn(). ipv6 addrconf: fix bug where deleting a mngtmpaddr can create a new temporary address ethernet: atheros: fix return value check in atl1e_tso_csum() phy: hisilicon: Fix an out of bounds check in hisi_inno_phy_probe() i40e: Fix an NULL vs IS_ERR() bug for debugfs_create_dir() ext4: fix to check return value of freeze_bdev() in ext4_shutdown() scsi: qla2xxx: Array index may go out of bound scsi: qla2xxx: Fix inconsistent format argument type in qla_os.c ftrace: Fix possible warning on checking all pages used in ftrace_process_locs() ftrace: Store the order of pages allocated in ftrace_page ftrace: Check if pages were allocated before calling free_pages() ftrace: Add information on number of page groups allocated fs: dlm: interrupt posix locks only when process is killed dlm: rearrange async condition return dlm: cleanup plock_op vs plock_xop PCI/ASPM: Avoid link retraining race PCI/ASPM: Factor out pcie_wait_for_retrain() PCI/ASPM: Return 0 or -ETIMEDOUT from pcie_retrain_link() PCI: Rework pcie_retrain_link() wait loop ext4: Fix reusing stale buffer heads from last failed mounting ext4: rename journal_dev to s_journal_dev inside ext4_sb_info btrfs: fix extent buffer leak after tree mod log failure at split_node() bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent bcache: remove 'int n' from parameter list of bch_bucket_alloc_set() bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set gpio: tps68470: Make tps68470_gpio_output() always set the initial value tracing/histograms: Return an error if we fail to add histogram to hist_vars list tcp: annotate data-races around fastopenq.max_qlen tcp: annotate data-races around tp->notsent_lowat tcp: annotate data-races around rskq_defer_accept tcp: annotate data-races around tp->linger2 net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX netfilter: nf_tables: can't schedule in nft_chain_validate netfilter: nf_tables: fix spurious set element insertion failure llc: Don't drop packet from non-root netns. fbdev: au1200fb: Fix missing IRQ check in au1200fb_drv_probe Revert "tcp: avoid the lookup process failing to get sk in ehash table" net:ipv6: check return value of pskb_trim() net: ethernet: ti: cpsw_ale: Fix cpsw_ale_get_field()/cpsw_ale_set_field() pinctrl: amd: Use amd_pinconf_set() for all config options fbdev: imxfb: warn about invalid left/right margin spi: bcm63xx: fix max prepend length igb: Fix igb_down hung on surprise removal wifi: iwlwifi: mvm: avoid baid size integer overflow wifi: wext-core: Fix -Wstringop-overflow warning in ioctl_standard_iw_point() bpf: Address KCSAN report on bpf_lru_list sched/fair: Don't balance task to its current running CPU posix-timers: Ensure timer ID search-loop limit is valid md/raid10: prevent soft lockup while flush writes md: fix data corruption for raid456 when reshape restart while grow up nbd: Add the maximum limit of allocated index in nbd_dev_add debugobjects: Recheck debug_objects_enabled before reporting ext4: correct inline offset when handling xattrs in inode body can: bcm: Fix UAF in bcm_proc_show() fuse: revalidate: don't invalidate if interrupted perf probe: Add test for regression introduced by switch to die_get_decl_file() tracing/histograms: Add histograms to hist_vars if they have referenced variables drm/atomic: Fix potential use-after-free in nonblocking commits scsi: qla2xxx: Pointer may be dereferenced scsi: qla2xxx: Check valid rport returned by fc_bsg_to_rport() scsi: qla2xxx: Fix potential NULL pointer dereference scsi: qla2xxx: Wait for io return on terminate rport xtensa: ISS: fix call to split_if_spec ring-buffer: Fix deadloop issue on reading trace_pipe tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error Revert "8250: add support for ASIX devices with a FIFO bug" meson saradc: fix clock divider mask length ceph: don't let check_caps skip sending responses for revoke msgs hwrng: imx-rngc - fix the timeout for init and self check serial: atmel: don't enable IRQs prematurely fs: dlm: return positive pid value for F_GETLK md/raid0: add discard support for the 'original' layout misc: pci_endpoint_test: Re-init completion for every test misc: pci_endpoint_test: Free IRQs before removing the device PCI: rockchip: Use u32 variable to access 32-bit registers PCI: rockchip: Fix legacy IRQ generation for RK3399 PCIe endpoint core PCI: rockchip: Add poll and timeout to wait for PHY PLLs to be locked PCI: rockchip: Write PCI Device ID to correct register PCI: rockchip: Assert PCI Configuration Enable bit after probe PCI: qcom: Disable write access to read only registers for IP v2.3.3 PCI: Add function 1 DMA alias quirk for Marvell 88SE9235 PCI/PM: Avoid putting EloPOS E2/S2/H2 PCIe Ports in D3cold jfs: jfs_dmap: Validate db_l2nbperpage while mounting ext4: only update i_reserved_data_blocks on successful block allocation ext4: fix wrong unit use in ext4_mb_clear_bb perf intel-pt: Fix CYC timestamps after standalone CBR SUNRPC: Fix UAF in svc_tcp_listen_data_ready() net: bcmgenet: Ensure MDIO unregistration has clocks enabled tpm: tpm_vtpm_proxy: fix a race condition in /dev/vtpmx creation pinctrl: amd: Only use special debounce behavior for GPIO 0 pinctrl: amd: Detect internal GPIO0 debounce handling pinctrl: amd: Fix mistake in handling clearing pins at startup net/sched: make psched_mtu() RTNL-less safe wifi: airo: avoid uninitialized warning in airo_get_rate() ipv6/addrconf: fix a potential refcount underflow for idev NTB: ntb_tool: Add check for devm_kcalloc NTB: ntb_transport: fix possible memory leak while device_register() fails ntb: intel: Fix error handling in intel_ntb_pci_driver_init() NTB: amd: Fix error handling in amd_ntb_pci_driver_init() ntb: idt: Fix error handling in idt_pci_driver_init() udp6: fix udp6_ehashfn() typo icmp6: Fix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev(). vrf: Increment Icmp6InMsgs on the original netdev net: mvneta: fix txq_map in case of txq_number==1 workqueue: clean up WORK_ constant types, clarify masking net: lan743x: Don't sleep in atomic context netfilter: nf_tables: prevent OOB access in nft_byteorder_eval netfilter: conntrack: Avoid nf_ct_helper_hash uses after free netfilter: nf_tables: fix scheduling-while-atomic splat netfilter: nf_tables: unbind non-anonymous set if rule construction fails netfilter: nf_tables: reject unbound anonymous set before commit phase netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE netfilter: nf_tables: use net_generic infra for transaction data netfilter: add helper function to set up the nfnetlink header and use it netfilter: nftables: add helper function to set the base sequence number netfilter: nf_tables: add rescheduling points during loop detection walks netfilter: nf_tables: fix nat hook table deletion spi: spi-fsl-spi: allow changing bits_per_word while CS is still active spi: spi-fsl-spi: relax message sanity checking a little spi: spi-fsl-spi: remove always-true conditional in fsl_spi_do_one_msg ARM: orion5x: fix d2net gpio initialization btrfs: fix race when deleting quota root from the dirty cow roots list jffs2: reduce stack usage in jffs2_build_xattr_subsystem() integrity: Fix possible multiple allocation in integrity_inode_get() bcache: Remove unnecessary NULL point check in node allocations mmc: core: disable TRIM on Micron MTFC4GACAJCN-1M mmc: core: disable TRIM on Kingston EMMC04G-M627 NFSD: add encoding of op_recall flag for write delegation ALSA: jack: Fix mutex call in snd_jack_report() i2c: xiic: Don't try to handle more interrupt events after error i2c: xiic: Defer xiic_wakeup() and __xiic_start_xfer() in xiic_process() sh: dma: Fix DMA channel offset calculation net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX tcp: annotate data races in __tcp_oow_rate_limited() net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode powerpc: allow PPC_EARLY_DEBUG_CPM only when SERIAL_CPM=y f2fs: fix error path handling in truncate_dnode() mailbox: ti-msgmgr: Fill non-message tx data fields with 0x0 spi: bcm-qspi: return error if neither hif_mspi nor mspi is available Add MODULE_FIRMWARE() for FIRMWARE_TG357766. sctp: fix potential deadlock on &net->sctp.addr_wq_lock rtc: st-lpc: Release some resources in st_rtc_probe() in case of error mfd: stmpe: Only disable the regulators if they are enabled mfd: intel-lpss: Add missing check for platform_get_resource KVM: s390: fix KVM_S390_GET_CMMA_BITS for GFNs in memslot holes mfd: rt5033: Drop rt5033-battery sub-device usb: phy: phy-tahvo: fix memory leak in tahvo_usb_probe() extcon: Fix kernel doc of property capability fields to avoid warnings extcon: Fix kernel doc of property fields to avoid warnings media: usb: siano: Fix warning due to null work_func_t function pointer media: videodev2.h: Fix struct v4l2_input tuner index comment media: usb: Check az6007_read() return value sh: j2: Use ioremap() to translate device tree address into kernel memory w1: fix loop in w1_fini() block: change all __u32 annotations to __be32 in affs_hardblocks.h USB: serial: option: add LARA-R6 01B PIDs ARC: define ASM_NL and __ALIGN(_STR) outside #ifdef __ASSEMBLY__ guard ARCv2: entry: rewrite to enable use of double load/stores LDD/STD ARCv2: entry: avoid a branch ARCv2: entry: push out the Z flag unclobber from common EXCEPTION_PROLOGUE ARCv2: entry: comments about hardware auto-save on taken interrupts modpost: fix section mismatch message for R_ARM_{PC24,CALL,JUMP24} modpost: fix section mismatch message for R_ARM_ABS32 crypto: nx - fix build warnings when DEBUG_FS is not enabled hwrng: virtio - Fix race on data_avail and actual data hwrng: virtio - always add a pending request hwrng: virtio - don't waste entropy hwrng: virtio - don't wait on cleanup hwrng: virtio - add an internal buffer pinctrl: at91-pio4: check return value of devm_kasprintf() perf dwarf-aux: Fix off-by-one in die_get_varname() pinctrl: cherryview: Return correct value if pin in push-pull mode PCI: Add pci_clear_master() stub for non-CONFIG_PCI scsi: 3w-xxxx: Add error handling for initialization failure in tw_probe() ALSA: ac97: Fix possible NULL dereference in snd_ac97_mixer drm/radeon: fix possible division-by-zero errors fbdev: omapfb: lcd_mipid: Fix an error handling path in mipid_spi_probe() arm64: dts: renesas: ulcb-kf: Remove flow control for SCIF1 IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors soc/fsl/qe: fix usb.c build errors ASoC: es8316: Increment max value for ALC Capture Target Volume control ARM: ep93xx: fix missing-prototype warnings drm/panel: simple: fix active size for Ampire AM-480272H3TMQW-T01H Input: adxl34x - do not hardcode interrupt trigger type ARM: dts: BCM5301X: Drop "clock-names" from the SPI node Input: drv260x - sleep between polling GO bit radeon: avoid double free in ci_dpm_init() netlink: Add __sock_i_ino() for __netlink_diag_dump(). ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_conntrack_sip: fix the ct_sip_parse_numerical_param() return value. lib/ts_bm: reset initial match offset for every block of text gtp: Fix use-after-free in __gtp_encap_destroy(). netlink: do not hard code device address lenth in fdb dumps netlink: fix potential deadlock in netlink_set_err() wifi: ath9k: convert msecs to jiffies where needed wifi: ath9k: Fix possible stall on ath9k_txq_list_has_key() memstick r592: make memstick_debug_get_tpc_name() static kexec: fix a memory leak in crash_shrink_memory() watchdog/perf: more properly prevent false positives with turbo modes watchdog/perf: define dummy watchdog_update_hrtimer_threshold() on correct config wifi: rsi: Do not set MMC_PM_KEEP_POWER in shutdown wifi: ath9k: don't allow to overwrite ENDPOINT0 attributes wifi: ray_cs: Fix an error handling path in ray_probe() wifi: ray_cs: Drop useless status variable in parse_addr() wifi: ray_cs: Utilize strnlen() in parse_addr() wifi: wl3501_cs: Fix an error handling path in wl3501_probe() wl3501_cs: use eth_hw_addr_set() net: create netdev->dev_addr assignment helpers wl3501_cs: Fix misspelling and provide missing documentation wl3501_cs: Remove unnecessary NULL check wl3501_cs: Fix a bunch of formatting issues related to function docs wifi: atmel: Fix an error handling path in atmel_probe() wifi: orinoco: Fix an error handling path in orinoco_cs_probe() wifi: orinoco: Fix an error handling path in spectrum_cs_probe() nfc: llcp: fix possible use of uninitialized variable in nfc_llcp_send_connect() nfc: constify several pointers to u8, char and sk_buff wifi: mwifiex: Fix the size of a memory allocation in mwifiex_ret_802_11_scan() samples/bpf: Fix buffer overflow in tcp_basertt wifi: ath9k: avoid referencing uninit memory in ath9k_wmi_ctrl_rx wifi: ath9k: fix AR9003 mac hardware hang check register offset calculation evm: Complete description of evm_inode_setattr() ARM: 9303/1: kprobes: avoid missing-declaration warnings PM: domains: fix integer overflow issues in genpd_parse_state() clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe clocksource/drivers/cadence-ttc: Use ttc driver as platform driver clocksource/drivers: Unify the names to timer-* format irqchip/jcore-aic: Fix missing allocation of IRQ descriptors irqchip/jcore-aic: Kill use of irq_create_strict_mappings() md/raid10: fix io loss while replacement replace rdev md/raid10: fix wrong setting of max_corr_read_errors md/raid10: fix overflow of md/safe_mode_delay md/raid10: check slab-out-of-bounds in md_bitmap_get_counter treewide: Remove uninitialized_var() usage drm/amdgpu: Validate VM ioctl flags. scripts/tags.sh: Resolve gtags empty index generation drm/edid: Fix uninitialized variable in drm_cvt_modes() fbdev: imsttfb: Fix use after free bug in imsttfb_probe video: imsttfb: check for ioremap() failures x86/smp: Use dedicated cache-line for mwait_play_dead() gfs2: Don't deref jdesc in evict Linux 4.19.290 x86: fix backwards merge of GDS/SRSO bit xen/netback: Fix buffer overrun triggered by unusual packet Documentation/x86: Fix backwards on/off logic about YMM support x86/xen: Fix secondary processors' FPU initialization KVM: Add GDS_NO support to KVM x86/speculation: Add Kconfig option for GDS x86/speculation: Add force option to GDS mitigation x86/speculation: Add Gather Data Sampling mitigation x86/fpu: Move FPU initialization into arch_cpu_finalize_init() x86/fpu: Mark init functions __init x86/fpu: Remove cpuinfo argument from init functions init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init() init: Invoke arch_cpu_finalize_init() earlier init: Remove check_bugs() leftovers um/cpu: Switch to arch_cpu_finalize_init() sparc/cpu: Switch to arch_cpu_finalize_init() sh/cpu: Switch to arch_cpu_finalize_init() mips/cpu: Switch to arch_cpu_finalize_init() m68k/cpu: Switch to arch_cpu_finalize_init() ia64/cpu: Switch to arch_cpu_finalize_init() ARM: cpu: Switch to arch_cpu_finalize_init() x86/cpu: Switch to arch_cpu_finalize_init() init: Provide arch_cpu_finalize_init() Conflicts: drivers/mmc/core/block.c drivers/mmc/host/sdhci-msm.c drivers/usb/dwc3/core.c drivers/usb/dwc3/gadget.c Change-Id: Id2f4d5c8067f8e5eda39c0eaa5e59d54a394b4c7	2023-09-19 18:11:03 +03:00
Kees Cook	b7e389235c	treewide: Remove uninitialized_var() usage commit 3f649ab728cda8038259d8f14492fe400fbab911 upstream. Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' \| cut -d: -f1 \| sort -u \| \ xargs perl -pi -e \ 's/\buninitialized_var$([^$]+)\)/\1/g; s:\s/\ (GCC be quiet\|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 11:45:01 +02:00
Ivaylo Georgiev	5996b2fe7b	Merge android-4.19.63 (`75ff56e`) into msm-4.19 * refs/heads/tmp-75ff56e: Linux 4.19.63 access: avoid the RCU grace period for the temporary subjective credentials libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl() powerpc/tm: Fix oops on sigreturn on systems without TM powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask() ALSA: hda - Add a conexant codec entry to let mute led work ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1 ALSA: ac97: Fix double free of ac97_codec_device hpet: Fix division by zero in hpet_time_div() mei: me: add mule creek canyon (EHL) device ids fpga-manager: altera-ps-spi: Fix build error binder: prevent transactions to context manager from its own process. x86/speculation/mds: Apply more accurate check on hypervisor platform x86/sysfb_efi: Add quirks for some devices with swapped width and height btrfs: inode: Don't compress if NODATASUM or NODATACOW set usb: pci-quirks: Correct AMD PLL quirk detection usb: wusbcore: fix unbalanced get/put cluster_id locking/lockdep: Hide unused 'class' variable mm: use down_read_killable for locking mmap_sem in access_remote_vm locking/lockdep: Fix lock used or unused stats error proc: use down_read_killable mmap_sem for /proc/pid/maps cxgb4: reduce kernel stack usage in cudbg_collect_mem_region() proc: use down_read_killable mmap_sem for /proc/pid/map_files proc: use down_read_killable mmap_sem for /proc/pid/clear_refs proc: use down_read_killable mmap_sem for /proc/pid/pagemap proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup mm/mmu_notifier: use hlist_add_head_rcu() memcg, fsnotify: no oom-kill for remote memcg charging mm/gup.c: remove some BUG_ONs from get_gate_page() mm/gup.c: mark undo_dev_pagemap as __maybe_unused 9p: pass the correct prototype to read_cache_page mm/kmemleak.c: fix check for softirq context sh: prevent warnings when using iounmap block/bio-integrity: fix a memory leak bug powerpc/eeh: Handle hugepages in ioremap space dlm: check if workqueues are NULL before flushing/destroying mailbox: handle failed named mailbox channel request f2fs: avoid out-of-range memory access block: init flush rq ref count to 1 powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h PCI: dwc: pci-dra7xx: Fix compilation when !CONFIG_GPIOLIB RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM perf hists browser: Fix potential NULL pointer dereference found by the smatch tool perf annotate: Fix dereferencing freed memory found by the smatch tool perf session: Fix potential NULL pointer dereference found by the smatch tool perf top: Fix potential NULL pointer dereference detected by the smatch tool perf stat: Fix use-after-freed pointer detected by the smatch tool perf test mmap-thread-lookup: Initialize variable to suppress memory sanitizer warning PCI: mobiveil: Use the 1st inbound window for MEM inbound transactions PCI: mobiveil: Initialize Primary/Secondary/Subordinate bus numbers kallsyms: exclude kasan local symbols on s390 PCI: mobiveil: Fix the Class Code field PCI: mobiveil: Fix PCI base address in MEM/IO outbound windows arm64: assembler: Switch ESB-instruction with a vanilla nop if !ARM64_HAS_RAS IB/ipoib: Add child to parent list only if device initialized powerpc/mm: Handle page table allocation failures IB/mlx5: Fixed reporting counters on 2nd port for Dual port RoCE serial: sh-sci: Fix TX DMA buffer flushing and workqueue races serial: sh-sci: Terminate TX DMA during buffer flushing RDMA/i40iw: Set queue pair state when being queried powerpc/4xx/uic: clear pending interrupt after irq type/pol change um: Silence lockdep complaint about mmap_sem mm/swap: fix release_pages() when releasing devmap pages mfd: hi655x-pmic: Fix missing return value check for devm_regmap_init_mmio_clk mfd: arizona: Fix undefined behavior mfd: core: Set fwnode for created devices mfd: madera: Add missing of table registration recordmcount: Fix spurious mcount entries on powerpc powerpc/xmon: Fix disabling tracing while in xmon powerpc/cacheflush: fix variable set but not used iio: iio-utils: Fix possible incorrect mask calculation PCI: xilinx-nwl: Fix Multi MSI data programming genksyms: Teach parser about 128-bit built-in types kbuild: Add -Werror=unknown-warning-option to CLANG_FLAGS i2c: stm32f7: fix the get_irq error cases PCI: sysfs: Ignore lockdep for remove attribute serial: mctrl_gpio: Check if GPIO property exisits before requesting it drm/msm: Depopulate platform on probe failure powerpc/pci/of: Fix OF flags parsing for 64bit BARs mmc: sdhci: sdhci-pci-o2micro: Check if controller supports 8-bit width usb: gadget: Zero ffs_io_data tty: serial_core: Set port active bit in uart_port_activate serial: imx: fix locking in set_termios() drm/rockchip: Properly adjust to a true clock in adjusted_mode powerpc/pseries/mobility: prevent cpu hotplug during DT update drm/amd/display: fix compilation error phy: renesas: rcar-gen2: Fix memory leak at error paths drm/virtio: Add memory barriers for capset cache. drm/amd/display: Always allocate initial connector state state serial: 8250: Fix TX interrupt handling condition tty: serial: msm_serial: avoid system lockup condition tty/serial: digicolor: Fix digicolor-usart already registered warning memstick: Fix error cleanup path of memstick_init drm/crc-debugfs: Also sprinkle irqrestore over early exits drm/crc-debugfs: User irqsafe spinlock in drm_crtc_add_crc_entry gpu: host1x: Increase maximum DMA segment size drm/bridge: sii902x: pixel clock unit is 10kHz instead of 1kHz drm/bridge: tc358767: read display_props in get_modes() PCI: Return error if cannot probe VF drm/edid: Fix a missing-check bug in drm_load_edid_firmware() drm/amdkfd: Fix sdma queue map issue drm/amdkfd: Fix a potential memory leak drm/amd/display: Disable ABM before destroy ABM struct drm/amdgpu/sriov: Need to initialize the HDP_NONSURFACE_BAStE drm/amd/display: Fill prescale_params->scale for RGB565 tty: serial: cpm_uart - fix init when SMC is relocated pinctrl: rockchip: fix leaked of_node references tty: max310x: Fix invalid baudrate divisors calculator usb: core: hub: Disable hub-initiated U1/U2 staging: vt6656: use meaningful error code during buffer allocation iio: adc: stm32-dfsdm: missing error case during probe iio: adc: stm32-dfsdm: manage the get_irq error case drm/panel: simple: Fix panel_simple_dsi_probe hvsock: fix epollout hang from race condition Change-Id: I4c37256db5ec08367a22a1c50bb97db267c822da Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>	2019-08-13 01:20:38 -07:00
Ira Weiny	30edc7c1fe	mm/swap: fix release_pages() when releasing devmap pages [ Upstream commit c5d6c45e90c49150670346967971e14576afd7f1 ] release_pages() is an optimized version of a loop around put_page(). Unfortunately for devmap pages the logic is not entirely correct in release_pages(). This is because device pages can be more than type MEMORY_DEVICE_PUBLIC. There are in fact 4 types, private, public, FS DAX, and PCI P2PDMA. Some of these have specific needs to "put" the page while others do not. This logic to handle any special needs is contained in put_devmap_managed_page(). Therefore all devmap pages should be processed by this function where we can contain the correct logic for a page put. Handle all device type pages within release_pages() by calling put_devmap_managed_page() on all devmap pages. If put_devmap_managed_page() returns true the page has been put and we continue with the next page. A false return of put_devmap_managed_page() means the page did not require special processing and should fall to "normal" processing. This was found via code inspection while determining if release_pages() and the new put_user_pages() could be interchangeable.[1] [1] https://lkml.kernel.org/r/20190523172852.GA27175@iweiny-DESK2.sc.intel.com Link: https://lkml.kernel.org/r/20190605214922.17684-1-ira.weiny@intel.com Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-31 07:27:03 +02:00
qctecmdr	1e525b68fe	Merge "Merge android-4.19.31 (`bb418a1`) into msm-4.19"	2019-05-01 10:57:19 -07:00
Laurent Dufour	808c47e1d5	mm: introduce __lru_cache_add_active_or_unevictable The speculative page fault handler which is run without holding the mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags is not guaranteed to remain constant. Introducing __lru_cache_add_active_or_unevictable() which has the vma flags value parameter instead of the vma pointer. Change-Id: I68decbe0f80847403127c45c97565e47512532e9 Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:20 [vinmenon@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> [charante@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2019-03-29 03:09:25 -07:00
Michal Hocko	e6e9d6e290	mm: handle lru_add_drain_all for UP properly [ Upstream commit 6ea183d60c469560e7b08a83c9804299e84ec9eb ] Since for_each_cpu(cpu, mask) added by commit `2d3854a37e` ("cpumask: introduce new API, without changing anything") did not evaluate the mask argument if NR_CPUS == 1 due to CONFIG_SMP=n, lru_add_drain_all() is hitting WARN_ON() at __flush_work() added by commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without INIT_WORK().") by unconditionally calling flush_work() [1]. Workaround this issue by using CONFIG_SMP=n specific lru_add_drain_all implementation. There is no real need to defer the implementation to the workqueue as the draining is going to happen on the local cpu. So alias lru_add_drain_all to lru_add_drain which does all the necessary work. [akpm@linux-foundation.org: fix various build warnings] [1] https://lkml.kernel.org/r/18a30387-6aa5-6123-e67c-57579ecc3f38@roeck-us.net Link: http://lkml.kernel.org/r/20190213124334.GH4525@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-03-23 20:09:50 +01:00
Dan Williams	e763848843	mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS In preparation for fixing dax-dma-vs-unmap issues, filesystems need to be able to rely on the fact that they will get wakeups on dev_pagemap page-idle events. Introduce MEMORY_DEVICE_FS_DAX and generic_dax_page_free() as common indicator / infrastructure for dax filesytems to require. With this change there are no users of the MEMORY_DEVICE_HOST designation, so remove it. The HMM sub-system extended dev_pagemap to arrange a callback when a dev_pagemap managed page is freed. Since a dev_pagemap page is free / idle when its reference count is 1 it requires an additional branch to check the page-type at put_page() time. Given put_page() is a hot-path we do not want to incur that check if HMM is not in use, so a static branch is used to avoid that overhead when not necessary. Now, the FS_DAX implementation wants to reuse this mechanism for receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific static-key into a generic mechanism that either HMM or FS_DAX code paths can enable. For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support, care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure. However, we still need to support FS_DAX in the FS_DAX_LIMITED case implemented by the s390/dcssblk driver. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Thomas Meyer <thomas@m3y3r.de> Reported-by: Dave Jiang <dave.jiang@intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 06:59:39 -07:00
Mike Rapoport	002843de36	mm/swap.c: remove @cold parameter description for release_pages() The 'cold' parameter was removed from release_pages function by commit `c6f92f9fbe` ("mm: remove cold parameter for release_pages"). Update the description to match the code. Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-05 21:36:26 -07:00
Mike Rapoport	cb6f0f3480	mm/swap.c: make functions and their kernel-doc agree (again) There was a conflict between the commit `e02a9f048e` ("mm/swap.c: make functions and their kernel-doc agree") and the commit `f144c390f9` ("mm: docs: fix parameter names mismatch") that both tried to fix mismatch betweeen pagevec_lookup_entries() parameter names and their description. Since nr_entries is a better name for the parameter, fix the description again. Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-02-21 15:35:43 -08:00
Shakeel Butt	9c4e6b1a70	mm, mlock, vmscan: no more skipping pagevecs When a thread mlocks an address space backed either by file pages which are currently not present in memory or swapped out anon pages (not in swapcache), a new page is allocated and added to the local pagevec (lru_add_pvec), I/O is triggered and the thread then sleeps on the page. On I/O completion, the thread can wake on a different CPU, the mlock syscall will then sets the PageMlocked() bit of the page but will not be able to put that page in unevictable LRU as the page is on the pagevec of a different CPU. Even on drain, that page will go to evictable LRU because the PageMlocked() bit is not checked on pagevec drain. The page will eventually go to right LRU on reclaim but the LRU stats will remain skewed for a long time. This patch puts all the pages, even unevictable, to the pagevecs and on the drain, the pages will be added on their LRUs correctly by checking their evictability. This resolves the mlocked pages on pagevec of other CPUs issue because when those pagevecs will be drained, the mlocked file pages will go to unevictable LRU. Also this makes the race with munlock easier to resolve because the pagevec drains happen in LRU lock. However there is still one place which makes a page evictable and does PageLRU check on that page without LRU lock and needs special attention. TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock(). #0: __pagevec_lru_add_fn #1: clear_page_mlock SetPageLRU() if (!TestClearPageMlocked()) return smp_mb() // <--required // inside does PageLRU if (!PageMlocked()) if (isolate_lru_page()) move to evictable LRU putback_lru_page() else move to unevictable LRU In '#1', TestClearPageMlocked() provides full memory barrier semantics and thus the PageLRU check (inside isolate_lru_page) can not be reordered before it. In '#0', without explicit memory barrier, the PageMlocked() check can be reordered before SetPageLRU(). If that happens, '#0' can put a page in unevictable LRU and '#1' might have just cleared the Mlocked bit of that page but fails to isolate as PageLRU fails as '#0' still hasn't set PageLRU bit of that page. That page will be stranded on the unevictable LRU. There is one (good) side effect though. Without this patch, the pages allocated for System V shared memory segment are added to evictable LRUs even after shmctl(SHM_LOCK) on that segment. This patch will correctly put such pages to unevictable LRU. Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Jan Kara <jack@suse.cz> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-02-21 15:35:42 -08:00
Mike Rapoport	f144c390f9	mm: docs: fix parameter names mismatch There are several places where parameter descriptions do no match the actual code. Fix it. Link: http://lkml.kernel.org/r/1516700871-22279-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-02-06 18:32:48 -08:00
Randy Dunlap	e02a9f048e	mm/swap.c: make functions and their kernel-doc agree Fix some basic kernel-doc notation in mm/swap.c: - for function lru_cache_add_anon(), make its kernel-doc function name match its function name and change colon to hyphen following the function name - for function pagevec_lookup_entries(), change the function parameter name from nr_pages to nr_entries since that is more descriptive of what the parameter actually is and then it matches the kernel-doc comments also Fix function kernel-doc to match the change in commit `67fd707f46`: - drop the kernel-doc notation for @nr_pages from pagevec_lookup_range() and correct the function description for that change Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org Fixes: `67fd707f46` ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-01-31 17:18:40 -08:00
Michal Hocko	9852a72123	mm: drop hotplug lock from lru_add_drain_all() Pulling cpu hotplug locks inside the mm core function like lru_add_drain_all just asks for problems and the recent lockdep splat [1] just proves this. While the usage in that particular case might be wrong we should avoid the locking as lru_add_drain_all() is used in many places. It seems that this is not all that hard to achieve actually. We have done the same thing for drain_all_pages which is analogous by commit `a459eeb7b8` ("mm, page_alloc: do not depend on cpu hotplug locks inside the allocator"). All we have to care about is to handle - the work item might be executed on a different cpu in worker from unbound pool so it doesn't run on pinned on the cpu - we have to make sure that we do not race with page_alloc_cpu_dead calling lru_add_drain_cpu the first part is already handled because the worker calls lru_add_drain which disables preemption when calling lru_add_drain_cpu on the local cpu it is draining. The later is true because page_alloc_cpu_dead is called on the controlling CPU after the hotplugged CPU vanished completely. [1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com [add a cpu hotplug locking interaction as per tglx] Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-01-31 17:18:36 -08:00
Mel Gorman	7f0b5fb953	mm, pagevec: rename pagevec drained field According to Vlastimil Babka, the drained field in pagevec is potentially misleading because it might be interpreted as draining this pagevec instead of the percpu lru pagevecs. Rename the field for clarity. Link: http://lkml.kernel.org/r/20171019093346.ylahzdpzmoriyf4v@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:06 -08:00
Mel Gorman	2d4894b5d2	mm: remove cold parameter from free_hot_cold_page* Most callers users of free_hot_cold_page claim the pages being released are cache hot. The exception is the page reclaim paths where it is likely that enough pages will be freed in the near future that the per-cpu lists are going to be recycled and the cache hotness information is lost. As no one really cares about the hotness of pages being released to the allocator, just ditch the parameter. The APIs are renamed to indicate that it's no longer about hot/cold pages. It should also be less confusing as there are subtle differences between them. __free_pages drops a reference and frees a page when the refcount reaches zero. free_hot_cold_page handled pages whose refcount was already zero which is non-obvious from the name. free_unref_page should be more obvious. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. [mgorman@techsingularity.net: add pages to head, not tail] Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:06 -08:00
Mel Gorman	c6f92f9fbe	mm: remove cold parameter for release_pages All callers of release_pages claim the pages being released are cache hot. As no one cares about the hotness of pages being released to the allocator, just ditch the parameter. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:06 -08:00
Mel Gorman	8667982014	mm, pagevec: remove cold parameter for pagevecs Every pagevec_init user claims the pages being released are hot even in cases where it is unlikely the pages are hot. As no one cares about the hotness of pages being released to the allocator, just ditch the parameter. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:06 -08:00
Mel Gorman	d9ed0d08b6	mm: only drain per-cpu pagevecs once per pagevec usage When a pagevec is initialised on the stack, it is generally used multiple times over a range of pages, looking up entries and then releasing them. On each pagevec_release, the per-cpu deferred LRU pagevecs are drained on the grounds the page being released may be on those queues and the pages may be cache hot. In many cases only the first drain is necessary as it's unlikely that the range of pages being walked is racing against LRU addition. Even if there is such a race, the impact is marginal where as constantly redraining the lru pagevecs costs. This patch ensures that pagevec is only drained once in a given lifecycle without increasing the cache footprint of the pagevec structure. Only sparsetruncate tiny is shown here as large files have many exceptional entries and calls pagecache_release less frequently. sparsetruncate (tiny) 4.14.0-rc4 4.14.0-rc4 batchshadow-v1r1 onedrain-v1r1 Min Time 141.00 ( 0.00%) 141.00 ( 0.00%) 1st-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%) 3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%) Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%) Max-95% Time 146.00 ( 0.00%) 145.00 ( 0.68%) Max-99% Time 198.00 ( 0.00%) 194.00 ( 2.02%) Max Time 254.00 ( 0.00%) 208.00 ( 18.11%) Amean Time 145.12 ( 0.00%) 144.30 ( 0.56%) Stddev Time 12.74 ( 0.00%) 9.62 ( 24.49%) Coeff Time 8.78 ( 0.00%) 6.67 ( 24.06%) Best99%Amean Time 144.29 ( 0.00%) 143.82 ( 0.32%) Best95%Amean Time 142.68 ( 0.00%) 142.31 ( 0.26%) Best90%Amean Time 142.52 ( 0.00%) 142.19 ( 0.24%) Best75%Amean Time 142.26 ( 0.00%) 141.98 ( 0.20%) Best50%Amean Time 141.90 ( 0.00%) 141.71 ( 0.13%) Best25%Amean Time 141.80 ( 0.00%) 141.43 ( 0.26%) The impact on bonnie is marginal and within the noise because a significant percentage of the file being truncated has been reclaimed and consists of shadow entries which reduce the hotness of the pagevec_release path. Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:06 -08:00
Jan Kara	67fd707f46	mm: remove nr_pages argument from pagevec_lookup_{,range}_tag() All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:04 -08:00
Jan Kara	93d3b7140a	mm: add variant of pagevec_lookup_range_tag() taking number of pages Currently pagevec_lookup_range_tag() takes number of pages to look up but most users don't need this. Create a new function pagevec_lookup_range_nr_tag() that takes maximum number of pages to lookup for Ceph which wants this functionality so that we can drop nr_pages argument from pagevec_lookup_range_tag(). Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:04 -08:00
Jan Kara	72b045aecd	mm: implement find_get_pages_range_tag() Patch series "Ranged pagevec tagged lookup", v3. In this series I provide a ranged variant of pagevec_lookup_tag() and use it in places where it makes sense. This series removes some common code and it also has a potential for speeding up some operations similarly as for pagevec_lookup_range() (but for now I can think of only artificial cases where this happens). This patch (of 16): Implement a variant of find_get_pages_tag() that stops iterating at given index. Lots of users of this function (through pagevec_lookup()) actually want a range lookup and all of them are currently open-coding this. Also create corresponding pagevec_lookup_range_tag() function. Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: David Howells <dhowells@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Steve French <sfrench@samba.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:03 -08:00
Shaohua Li	24c92eb7dc	mm: avoid marking swap cached page as lazyfree MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear SwapBacked). There is no lock to prevent the page is added to swap cache between these two steps by page reclaim. Page reclaim could add the page to swap cache and unmap the page. After page reclaim, the page is added back to lru. At that time, we probably start draining per-cpu pagevec and mark the page lazyfree. So the page could be in a state with SwapBacked cleared and PG_swapcache set. Next time there is a refault in the virtual address, do_swap_page can find the page from swap cache but the page has PageSwapCache false because SwapBacked isn't set, so do_swap_page will bail out and do nothing. The task will keep running into fault handler. Fixes: `802a3a92ad` ("mm: reclaim MADV_FREE pages") Link: http://lkml.kernel.org/r/6537ef3814398c0073630b03f176263bc81f0902.1506446061.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Reported-by: Artem Savkov <asavkov@redhat.com> Tested-by: Artem Savkov <asavkov@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [4.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-10-03 17:54:24 -07:00
Jérôme Glisse	df6ad69838	mm/device-public-memory: device memory cache coherent with CPU Platform with advance system bus (like CAPI or CCIX) allow device memory to be accessible from CPU in a cache coherent fashion. Add a new type of ZONE_DEVICE to represent such memory. The use case are the same as for the un-addressable device memory but without all the corners cases. Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Nellans <dnellans@nvidia.com> Cc: Evgeny Baskakov <ebaskakov@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sherry Cheung <SCheung@nvidia.com> Cc: Subhash Gutti <sgutti@nvidia.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Bob Liu <liubo95@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-08 18:26:46 -07:00
Jan Kara	397162ffa2	mm: remove nr_pages argument from pagevec_lookup{,_range}() All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:27 -07:00
Jan Kara	b947cee4b9	mm: implement find_get_pages_range() Implement a variant of find_get_pages() that stops iterating at given index. This may be substantial performance gain if the mapping is sparse. See following commit for details. Furthermore lots of users of this function (through pagevec_lookup()) actually want a range lookup and all of them are currently open-coding this. Also create corresponding pagevec_lookup_range() function. Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:26 -07:00
Jan Kara	d72dc8a25a	mm: make pagevec_lookup() update index Make pagevec_lookup() (and underlying find_get_pages()) update index to the next page where iteration should continue. Most callers want this and also pagevec_lookup_tag() already does this. Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:26 -07:00
Thomas Gleixner	a47fed5b5b	mm: swap: provide lru_add_drain_all_cpuslocked() The rework of the cpu hotplug locking unearthed potential deadlocks with the memory hotplug locking code. The solution for these is to rework the memory hotplug locking code as well and take the cpu hotplug lock before the memory hotplug lock in mem_hotplug_begin(), but this will cause a recursive locking of the cpu hotplug lock when the memory hotplug code calls lru_add_drain_all(). Split out the inner workings of lru_add_drain_all() into lru_add_drain_all_cpuslocked() so this function can be invoked from the memory hotplug code with the cpu hotplug lock held. Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-10 16:32:33 -07:00
Roman Gushchin	2262185c5b	mm: per-cgroup memory reclaim stats Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Shaohua Li	f7ad2a6cb9	mm: move MADV_FREE pages into LRU_INACTIVE_FILE list madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still anonymous pages, but they can be freed without pageout. To distinguish these from normal anonymous pages, we clear their SwapBacked flag. MADV_FREE pages could be freed without pageout, so they pretty much like used once file pages. For such pages, we'd like to reclaim them once there is memory pressure. Also it might be unfair reclaiming MADV_FREE pages always before used once file pages and we definitively want to reclaim the pages before other anonymous and file pages. To speed up MADV_FREE pages reclaim, we put the pages into LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny nowadays and should be full of used once file pages. Reclaiming MADV_FREE pages will not have much interfere of anonymous and active file pages. And the inactive file pages and MADV_FREE pages will be reclaimed according to their age, so we don't reclaim too many MADV_FREE pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also means we can reclaim the pages without swap support. This idea is suggested by Johannes. This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to avoid bisect failure, next patch will do it. The patch is based on Minchan's original patch. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:08 -07:00
Linus Torvalds	d3b5d35290	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main x86 MM changes in this cycle were: - continued native kernel PCID support preparation patches to the TLB flushing code (Andy Lutomirski) - various fixes related to 32-bit compat syscall returning address over 4Gb in applications, launched from 64-bit binaries - motivated by C/R frameworks such as Virtuozzo. (Dmitry Safonov) - continued Intel 5-level paging enablement: in particular the conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov) - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel) - ... plus misc updates, fixes and cleanups" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash x86/mm: Fix flush_tlb_page() on Xen x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm/64: Fix crash in remove_pagetable() Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" x86/boot/e820: Remove a redundant self assignment x86/mm: Fix dump pagetables for 4 levels of page tables x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()" x86/espfix: Add support for 5-level paging x86/kasan: Extend KASAN to support 5-level paging x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y x86/paravirt: Add 5-level support to the paravirt code x86/mm: Define virtual memory map for 5-level paging x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert x86/boot: Detect 5-level paging support x86/mm/numa: Remove numa_nodemask_from_meminfo() ...	2017-05-01 23:54:56 -07:00
Dan Williams	7138970383	mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery: Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur after devm_memremap_page_release(), since it now maintains its own elevated reference. This speeds up things while also making the pmem refcounting more robust going forward. Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-01 09:15:53 +02:00
Michal Hocko	ce612879dd	mm: move pcp and lru-pcp draining into single wq We currently have 2 specific WQ_RECLAIM workqueues in the mm code. vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain per cpu lru caches. This seems more than necessary because both can run on a single WQ. Both do not block on locks requiring a memory allocation nor perform any allocations themselves. We will save one rescuer thread this way. On the other hand drain_all_pages() queues work on the system wq which doesn't have rescuer and so this depend on memory allocation (when all workers are stuck allocating and new ones cannot be created). Initially we thought this would be more of a theoretical problem but Hugh Dickins has reported: : 4.11-rc has been giving me hangs after hours of swapping load. At : first they looked like memory leaks ("fork: Cannot allocate memory"); : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh" : before looking at /proc/meminfo one time, and the stat_refresh stuck : in D state, waiting for completion of flush_work like many kworkers. : kthreadd waiting for completion of flush_work in drain_all_pages(). This worker should be using WQ_RECLAIM as well in order to guarantee a forward progress. We can reuse the same one as for lru draining and vmstat. Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Yang Li <pku.leo@gmail.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-04-08 00:47:49 -07:00
Johannes Weiner	c55e8d035b	mm: vmscan: move dirty pages out of the way until they're flushed We noticed a performance regression when moving hadoop workloads from 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout activity initiated by kswapd as well as frequent bursts of allocation stalls and direct reclaim scans. Even lowering the dirty ratios to the equivalent of less than 1% of memory would not eliminate the issue, suggesting that dirty pages concentrate where the scanner is looking. This can be traced back to recent efforts of thrash avoidance. Where 3.10 would not detect refaulting pages and continuously supply clean cache to the inactive list, a thrashing workload on 4.0+ will detect and activate refaulting pages right away, distilling used-once pages on the inactive list much more effectively. This is by design, and it makes sense for clean cache. But for the most part our workload's cache faults are refaults and its use-once cache is from streaming writes. We end up with most of the inactive list dirty, and we don't go after the active cache as long as we have use-once pages around. But waiting for writes to avoid reclaiming clean cache that might refault is a bad trade-off. Even if the refaults happen, reads are faster than writes. Before getting bogged down on writeback, reclaim should first look at all cache in the system, even active cache. To accomplish this, activate pages that are dirty or under writeback when they reach the end of the inactive LRU. The pages are marked for immediate reclaim, meaning they'll get moved back to the inactive LRU tail as soon as they're written back and become reclaimable. But in the meantime, by reducing the inactive list to only immediately reclaimable pages, we allow the scanner to deactivate and refill the inactive list with clean cache from the active list tail to guarantee forward progress. [hannes@cmpxchg.org: update comment] Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-24 17:46:54 -08:00
Huang, Ying	4b3ef9daa4	mm/swap: split swap cache into 64MB trunks The patch is to improve the scalability of the swap out/in via using fine grained locks for the swap cache. In current kernel, one address space will be used for each swap device. And in the common configuration, the number of the swap device is very small (one is typical). This causes the heavy lock contention on the radix tree of the address space if multiple tasks swap out/in concurrently. But in fact, there is no dependency between pages in the swap cache. So that, we can split the one shared address space for each swap device into several address spaces to reduce the lock contention. In the patch, the shared address space is split into 64MB trunks. 64MB is chosen to balance the memory space usage and effect of lock contention reduction. The size of struct address_space on x86_64 architecture is 408B, so with the patch, 6528B more memory will be used for every 1GB swap space on x86_64 architecture. One address space is still shared for the swap entries in the same 64M trunks. To avoid lock contention for the first round of swap space allocation, the order of the swap clusters in the initial free clusters list is changed. The swap space distance between the consecutive swap clusters in the free cluster list is at least 64M. After the first round of allocation, the swap clusters are expected to be freed randomly, so the lock contention should be reduced effectively. Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:30 -08:00
Nicholas Piggin	6290602709	mm: add PageWaiters indicating tasks are waiting for a page bit Add a new page flag, PageWaiters, to indicate the page waitqueue has tasks waiting. This can be tested rather than testing waitqueue_active which requires another cacheline load. This bit is always set when the page has tasks on page_waitqueue(page), and is set and cleared under the waitqueue lock. It may be set when there are no tasks on the waitqueue, which will cause a harmless extra wakeup check that will clears the bit. The generic bit-waitqueue infrastructure is no longer used for pages. Instead, waitqueues are used directly with a custom key type. The generic code was not flexible enough to have PageWaiters manipulation under the waitqueue lock (which simplifies concurrency). This improves the performance of page lock intensive microbenchmarks by 2-3%. Putting two bits in the same word opens the opportunity to remove the memory barrier between clearing the lock bit and testing the waiters bit, after some work on the arch primitives (e.g., ensuring memory operand widths match and cover both bits). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-25 11:54:48 -08:00
Aaron Lu	6fcb52a56f	thp: reduce usage of huge zero page's atomic counter The global zero page is used to satisfy an anonymous read fault. If THP(Transparent HugePage) is enabled then the global huge zero page is used. The global huge zero page uses an atomic counter for reference counting and is allocated/freed dynamically according to its counter value. CPU time spent on that counter will greatly increase if there are a lot of processes doing anonymous read faults. This patch proposes a way to reduce the access to the global counter so that the CPU load can be reduced accordingly. To do this, a new flag of the mm_struct is introduced: MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch the global counter in two cases: 1 The first time it uses the global huge zero page; 2 The time when mm_user of its mm_struct reaches zero. Note that right now, the huge zero page is eligible to be freed as soon as its last use goes away. With this patch, the page will not be eligible to be freed until the exit of the last process from which it was ever used. And with the use of mm_user, the kthread is not eligible to use huge zero page either. Since no kthread is using huge zero page today, there is no difference after applying this patch. But if that is not desired, I can change it to when mm_count reaches zero. Case used for test on Haswell EP: usemem -n 72 --readonly -j 0x200000 100G Which spawns 72 processes and each will mmap 100G anonymous space and then do read only access to that space sequentially with a step of 2MB. CPU cycles from perf report for base commit: 54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page CPU cycles from perf report for this commit: 0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page Performance(throughput) of the workload for base commit: 1784430792 Performance(throughput) of the workload for this commit: 4726928591 164% increase. Runtime of the workload for base commit: 707592 us Runtime of the workload for this commit: 303970 us 50% drop. Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:28 -07:00
Mel Gorman	68eb0731c4	mm, pagevec: release/reacquire lru_lock on pgdat change With node-lru, the locking is based on the pgdat. Previously it was required that a pagevec drain released one zone lru_lock and acquired another zone lru_lock on every zone change. Now, it's only necessary if the node changes. The end-result is fewer lock release/acquires if the pages are all on the same node but in different zones. Link: http://lkml.kernel.org/r/1468588165-12461-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	599d0c954f	mm, vmscan: move LRU lists to node This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	a52633d8e9	mm, vmscan: move lru_lock to the node Node-based reclaim requires node-based LRUs and locking. This is a preparation patch that just moves the lru_lock to the node so later patches are easier to review. It is a mechanical change but note this patch makes contention worse because the LRU lock is hotter and direct reclaim and kswapd can contend on the same lock even when reclaiming from different zones. Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Kirill A. Shutemov	800d8c63b2	shmem: add huge pages support Here's basic implementation of huge pages support for shmem/tmpfs. It's all pretty streight-forward: - shmem_getpage() allcoates huge page if it can and try to inserd into radix tree with shmem_add_to_page_cache(); - shmem_add_to_page_cache() puts the page onto radix-tree if there's space for it; - shmem_undo_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. As we don't really create hole in this case, lseek(SEEK_HOLE) may have inconsistent results depending what pages happened to be allocated. - no need to change shmem_fault: core-mm will map an compound page as huge if VMA is suitable; Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-26 16:19:19 -07:00
Lukasz Odzioba	8f182270df	mm/swap.c: flush lru pvecs on compound page arrival Currently we can have compound pages held on per cpu pagevecs, which leads to a lot of memory unavailable for reclaim when needed. In the systems with hundreads of processors it can be GBs of memory. On of the way of reproducing the problem is to not call munmap explicitly on all mapped regions (i.e. after receiving SIGTERM). After that some pages (with THP enabled also huge pages) may end up on lru_add_pvec, example below. void main() { #pragma omp parallel { size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS void *p = mmap(NULL, size, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS , -1, 0); if (p != MAP_FAILED) memset(p, 0, size); //munmap(p, size); // uncomment to make the problem go away } } When we run it with THP enabled it will leave significant amount of memory on lru_add_pvec. This memory will be not reclaimed if we hit OOM, so when we run above program in a loop: for i in `seq 100`; do ./a.out; done many processes (95% in my case) will be killed by OOM. The primary point of the LRU add cache is to save the zone lru_lock contention with a hope that more pages will belong to the same zone and so their addition can be batched. The huge page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like a safer option when compared to a potential excess in the caching which can be quite large and much harder to fix because lru_add_drain_all is way to expensive and it is not really clear what would be a good moment to call it. Similarly we can reproduce the problem on lru_deactivate_pvec by adding: madvise(p, size, MADV_FREE); after memset. This patch flushes lru pvecs on compound page arrival making the problem less severe - after applying it kill rate of above example drops to 0%, due to reducing maximum amount of memory held on pvec from 28MB (with THP) to 56kB per CPU. Suggested-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Ming Li <mingli199x@qq.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-06-24 17:23:52 -07:00
Wang Sheng-Hui	f3a932baa7	mm: introduce dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all This patch is based on https://patchwork.ozlabs.org/patch/574623/. Tejun submitted commit `23d11a58a9` ("workqueue: skip flush dependency checks for legacy workqueues") for the legacy create*_workqueue() interface. But some workq created by alloc_workqueue still reports warning on memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set: workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu ------------[ cut here ]------------ WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c ... check_flush_dependency+0xb4/0x10c flush_work+0x54/0x140 lru_add_drain_all+0x138/0x188 migrate_prep+0xc/0x18 alloc_contig_range+0xf4/0x350 cma_alloc+0xec/0x1e4 dma_alloc_from_contiguous+0x38/0x40 __dma_alloc+0x74/0x25c nvme_alloc_queue+0xcc/0x36c nvme_reset_work+0x5c4/0xda8 process_one_work+0x128/0x2ec worker_thread+0x58/0x434 kthread+0xd4/0xe8 ret_from_fork+0x10/0x50 That's because lru_add_drain_all() will schedule the drain work on system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM. Introduce a dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all(), aiding in getting memory freed. Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.com Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-06-09 14:23:11 -07:00
Ming Li	a4a921aa5c	mm/swap.c: put activate_page_pvecs and other pagevecs together Put the activate_page_pvecs definition next to those of the other pagevecs, for clarity. Signed-off-by: Ming Li <mingli199x@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-05-20 17:58:30 -07:00
Kirill A. Shutemov	aa88b68c3b	thp: keep huge zero page pinned until tlb flush Andrea has found[1] a race condition on MMU-gather based TLB flush vs split_huge_page() or shrinker which frees huge zero under us (patch 1/2 and 2/2 respectively). With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the page pinned until flush is complete and the pin prevents the page from being split under us. We still need patch 2/2. This is simplified version of Andrea's patch. We don't need fancy encoding. [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-28 19:34:04 -07:00
Kirill A. Shutemov	ea1754a084	mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Kirill A. Shutemov	09cbfeaf1a	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced long time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Dan Williams	3565fce3a6	mm, x86: get_user_pages() for dax mappings A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver has established a devm_memremap_pages() mapping, i.e. when the pfn_t return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when encountering _PAGE_DEVMAP during a page table walk we lookup and pin a struct dev_pagemap instance to keep the result of pfn_to_page() valid until put_page(). Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Logan Gunthorpe <logang@deltatee.com> Cc: Dave Hansen <dave@sr71.net> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Minchan Kim	10853a0392	mm: move lazily freed pages to inactive list MADV_FREE is a hint that it's okay to discard pages if there is memory pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them so there is no value keeping them in the active anonymous LRU so this patch moves them to inactive LRU list's head. This means that MADV_FREE-ed pages which were living on the inactive list are reclaimed first because they are more likely to be cold rather than recently active pages. An arguable issue for the approach would be whether we should put the page to the head or tail of the inactive list. I chose head because the kernel cannot make sure it's really cold or warm for every MADV_FREE usecase but at least we know it's not hot, so landing of inactive head would be a comprimise for various usecases. This fixes suboptimal behavior of MADV_FREE when pages living on the active list will sit there for a long time even under memory pressure while the inactive list is reclaimed heavily. This basically breaks the whole purpose of using MADV_FREE to help the system to free memory which is might not be used. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Kirill A. Shutemov	e90309c9f7	thp: allow mlocked THP again Before THP refcounting rework, THP was not allowed to cross VMA boundary. So, if we have THP and we split it, PG_mlocked can be safely transferred to small pages. With new THP refcounting and naive approach to mlocking we can end up with this scenario: 1. we have a mlocked THP, which belong to one VM_LOCKED VMA. 2. the process does munlock() on the part of the THP: - the VMA is split into two, one of them VM_LOCKED; - huge PMD split into PTE table; - THP is still mlocked; 3. split_huge_page(): - it transfers PG_mlocked to all small pages regrardless if it blong to any VM_LOCKED VMA. We probably could munlock() all small pages on split_huge_page(), but I think we have accounting issue already on step two. Instead of forbidding mlocked pages altogether, we just avoid mlocking PTE-mapped THPs and munlock THPs on split_huge_pmd(). This means PTE-mapped THPs will be on normal lru lists and will be split under memory pressure by vmscan. After the split vmscan will detect unevictable small pages and mlock them. With this approach we shouldn't hit situation like described above. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00

1 2 3 4

200 Commits