Commit Graph

200 Commits

Author SHA1 Message Date
Michael Bestas
1b59618ce4 Merge tag 'ASB-2023-09-05_4.19-stable' of https://android.googlesource.com/kernel/common into android13-4.19-kona
https://source.android.com/docs/security/bulletin/2023-09-01

* tag 'ASB-2023-09-05_4.19-stable' of https://android.googlesource.com/kernel/common:
  Linux 4.19.294
  Revert "ARM: ep93xx: fix missing-prototype warnings"
  Revert "MIPS: Alchemy: fix dbdma2"
  Linux 4.19.293
  dma-buf/sw_sync: Avoid recursive lock during fence signal
  clk: Fix undefined reference to `clk_rate_exclusive_{get,put}'
  scsi: core: raid_class: Remove raid_component_add()
  scsi: snic: Fix double free in snic_tgt_create()
  irqchip/mips-gic: Don't touch vl_map if a local interrupt is not routable
  rtnetlink: Reject negative ifindexes in RTM_NEWLINK
  netfilter: nf_queue: fix socket leak
  sched/rt: pick_next_rt_entity(): check list_entry
  mmc: block: Fix in_flight[issue_type] value error
  x86/fpu: Set X86_FEATURE_OSXSAVE feature after enabling OSXSAVE in CR4
  PCI: acpiphp: Use pci_assign_unassigned_bridge_resources() only for non-root bus
  media: vcodec: Fix potential array out-of-bounds in encoder queue_setup
  lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels
  batman-adv: Fix batadv_v_ogm_aggr_send memory leak
  batman-adv: Fix TT global entry leak when client roamed back
  batman-adv: Do not get eth header before batadv_check_management_packet
  batman-adv: Don't increase MTU when set by user
  batman-adv: Trigger events for auto adjusted MTU
  nfsd: Fix race to FREE_STATEID and cl_revoked
  ibmveth: Use dcbf rather than dcbfl
  ipvs: fix racy memcpy in proc_do_sync_threshold
  ipvs: Improve robustness to the ipvs sysctl
  bonding: fix macvlan over alb bond support
  net: remove bond_slave_has_mac_rcu()
  net/sched: fix a qdisc modification with ambiguous command request
  igb: Avoid starting unnecessary workqueues
  dccp: annotate data-races in dccp_poll()
  sock: annotate data-races around prot->memory_pressure
  tracing: Fix memleak due to race between current_tracer and trace
  drm/amd/display: check TG is non-null before checking if enabled
  drm/amd/display: do not wait for mpc idle if tg is disabled
  regmap: Account for register length in SMBus I/O limits
  dm integrity: reduce vmalloc space footprint on 32-bit architectures
  dm integrity: increase RECALC_SECTORS to improve recalculate speed
  powerpc: Fail build if using recordmcount with binutils v2.37
  powerpc: remove leftover code of old GCC version checks
  powerpc/32: add stack protector support
  fbdev: fix potential OOB read in fast_imageblit()
  fbdev: Fix sys_imageblit() for arbitrary image widths
  fbdev: Improve performance of sys_imageblit()
  tty: serial: fsl_lpuart: add earlycon for imx8ulp platform
  Revert "tty: serial: fsl_lpuart: drop earlycon entry for i.MX8QXP"
  MIPS: cpu-features: Use boot_cpu_type for CPU type based features
  MIPS: cpu-features: Enable octeon_cache by cpu_type
  fs: dlm: fix mismatch of plock results from userspace
  fs: dlm: use dlm_plock_info for do_unlock_close
  fs: dlm: change plock interrupted message to debug again
  fs: dlm: add pid to debug log
  dlm: replace usage of found with dedicated list iterator variable
  dlm: improve plock logging if interrupted
  PCI: acpiphp: Reassign resources on bridge if necessary
  net: phy: broadcom: stub c45 read/write for 54810
  net: xfrm: Amend XFRMA_SEC_CTX nla_policy structure
  net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled
  virtio-net: set queues after driver_ok
  af_unix: Fix null-ptr-deref in unix_stream_sendpage().
  netfilter: set default timeout to 3 secs for sctp shutdown send and recv state
  test_firmware: prevent race conditions by a correct implementation of locking
  mmc: wbsd: fix double mmc_free_host() in wbsd_init()
  cifs: Release folio lock on fscache read hit.
  ALSA: usb-audio: Add support for Mythware XA001AU capture and playback interfaces.
  serial: 8250: Fix oops for port->pm on uart_change_pm()
  ASoC: meson: axg-tdm-formatter: fix channel slot allocation
  ASoC: rt5665: add missed regulator_bulk_disable
  net: do not allow gso_size to be set to GSO_BY_FRAGS
  sock: Fix misuse of sk_under_memory_pressure()
  i40e: fix misleading debug logs
  team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
  netfilter: nft_dynset: disallow object maps
  selftests: mirror_gre_changes: Tighten up the TTL test match
  xfrm: add NULL check in xfrm_update_ae_params
  ip_vti: fix potential slab-use-after-free in decode_session6
  ip6_vti: fix slab-use-after-free in decode_session6
  xfrm: fix slab-use-after-free in decode_session6
  xfrm: interface: rename xfrm_interface.c to xfrm_interface_core.c
  net: af_key: fix sadb_x_filter validation
  net: xfrm: Fix xfrm_address_filter OOB read
  btrfs: fix BUG_ON condition in btrfs_cancel_balance
  powerpc/rtas_flash: allow user copy to flash block cache objects
  fbdev: mmp: fix value check in mmphw_probe()
  virtio-mmio: don't break lifecycle of vm_dev
  virtio-mmio: Use to_virtio_mmio_device() to simply code
  virtio-mmio: convert to devm_platform_ioremap_resource
  nfsd: Remove incorrect check in nfsd4_validate_stateid
  nfsd4: kill warnings on testing stateids with mismatched clientids
  block: fix signed int overflow in Amiga partition support
  mmc: sunxi: fix deferred probing
  mmc: bcm2835: fix deferred probing
  mmc: Remove dev_err() usage after platform_get_irq()
  mmc: tmio: move tmio_mmc_set_clock() to platform hook
  mmc: tmio: replace tmio_mmc_clk_stop() calls with tmio_mmc_set_clock()
  mmc: meson-gx: remove redundant mmc_request_done() call from irq context
  mmc: meson-gx: remove useless lock
  USB: dwc3: qcom: fix NULL-deref on suspend
  usb: dwc3: qcom: Add helper functions to enable,disable wake irqs
  irqchip/mips-gic: Use raw spinlock for gic_lock
  irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
  x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms
  powerpc/64s/radix: Fix soft dirty tracking
  powerpc: Move page table dump files in a dedicated subdirectory
  powerpc/mm: dump block address translation on book3s/32
  powerpc/mm: dump segment registers on book3s/32
  powerpc/mm: Move pgtable_t into platform headers
  powerpc/mm: move platform specific mmu-xxx.h in platform directories
  iio: addac: stx104: Fix race condition when converting analog-to-digital
  iio: addac: stx104: Fix race condition for stx104_write_raw()
  iio: adc: stx104: Implement and utilize register structures
  iio: adc: stx104: Utilize iomap interface
  iio: add addac subdirectory
  IMA: allow/fix UML builds
  drm/amdgpu: Fix potential fence use-after-free v2
  Bluetooth: L2CAP: Fix use-after-free
  pcmcia: rsrc_nonstatic: Fix memory leak in nonstatic_release_resource_db()
  gfs2: Fix possible data races in gfs2_show_options()
  media: platform: mediatek: vpu: fix NULL ptr dereference
  media: v4l2-mem2mem: add lock to protect parameter num_rdy
  FS: JFS: Check for read-only mounted filesystem in txBegin
  FS: JFS: Fix null-ptr-deref Read in txBegin
  MIPS: dec: prom: Address -Warray-bounds warning
  fs: jfs: Fix UBSAN: array-index-out-of-bounds in dbAllocDmapLev
  udf: Fix uninitialized array access for some pathnames
  HID: add quirk for 03f0:464a HP Elite Presenter Mouse
  quota: fix warning in dqgrab()
  quota: Properly disable quotas when add_dquot_ref() fails
  ALSA: emu10k1: roll up loops in DSP setup code for Audigy
  drm/radeon: Fix integer overflow in radeon_cs_parser_init
  selftests: forwarding: tc_flower: Relax success criterion
  lib/mpi: Eliminate unused umul_ppmm definitions for MIPS
  Revert "posix-timers: Ensure timer ID search-loop limit is valid"
  UPSTREAM: media: usb: siano: Fix warning due to null work_func_t function pointer
  UPSTREAM: Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
  UPSTREAM: net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
  UPSTREAM: net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
  Linux 4.19.292
  sch_netem: fix issues in netem_change() vs get_dist_table()
  alpha: remove __init annotation from exported page_is_ram()
  scsi: core: Fix possible memory leak if device_add() fails
  scsi: snic: Fix possible memory leak if device_add() fails
  scsi: 53c700: Check that command slot is not NULL
  scsi: storvsc: Fix handling of virtual Fibre Channel timeouts
  scsi: core: Fix legacy /proc parsing buffer overflow
  netfilter: nf_tables: report use refcount overflow
  netfilter: nf_tables: bogus EBUSY when deleting flowtable after flush
  btrfs: don't stop integrity writeback too early
  ibmvnic: Handle DMA unmapping of login buffs in release functions
  wifi: cfg80211: fix sband iftype data lookup for AP_VLAN
  IB/hfi1: Fix possible panic during hotplug remove
  drivers: net: prevent tun_build_skb() to exceed the packet size limit
  dccp: fix data-race around dp->dccps_mss_cache
  bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
  net/packet: annotate data-races around tp->status
  mISDN: Update parameter type of dsp_cmx_send()
  drm/nouveau/disp: Revert a NULL check inside nouveau_connector_get_modes
  x86: Move gds_ucode_mitigated() declaration to header
  x86/mm: Fix VDSO and VVAR placement on 5-level paging machines
  x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405
  usb: dwc3: Properly handle processing of pending events
  usb-storage: alauda: Fix uninit-value in alauda_check_media()
  binder: fix memory leak in binder_init()
  iio: cros_ec: Fix the allocation size for cros_ec_command
  nilfs2: fix use-after-free of nilfs_root in dirtying inodes via iput
  radix tree test suite: fix incorrect allocation size for pthreads
  drm/nouveau/gr: enable memory loads on helper invocation on all channels
  dmaengine: pl330: Return DMA_PAUSED when transaction is paused
  ipv6: adjust ndisc_is_useropt() to also return true for PIO
  mmc: moxart: read scr register without changing byte order
  sparc: fix up arch_cpu_finalize_init() build breakage.
  UPSTREAM: net/sched: cls_fw: Fix improper refcount update leads to use-after-free
  Linux 4.19.291
  drm/edid: fix objtool warning in drm_cvt_modes()
  arm64: dts: stratix10: fix incorrect I2C property for SCL signal
  drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
  ARM: dts: nxp/imx6sll: fix wrong property name in usbphy node
  ARM: dts: imx6sll: fixup of operating points
  ARM: dts: imx: add usb alias
  ARM: dts: imx6sll: Make ssi node name same as other platforms
  PM: sleep: wakeirq: fix wake irq arming
  PM / wakeirq: support enabling wake-up irq after runtime_suspend called
  powerpc/mm/altmap: Fix altmap boundary check
  mtd: rawnand: omap_elm: Fix incorrect type in assignment
  test_firmware: return ENOMEM instead of ENOSPC on failed memory allocation
  test_firmware: fix a memory leak with reqs buffer
  ext2: Drop fragment support
  net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
  Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
  fs/sysv: Null check to prevent null-ptr-deref bug
  USB: zaurus: Add ID for A-300/B-500/C-700
  libceph: fix potential hang in ceph_osdc_notify()
  scsi: zfcp: Defer fc_rport blocking until after ADISC response
  tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
  tcp_metrics: annotate data-races around tm->tcpm_net
  tcp_metrics: annotate data-races around tm->tcpm_vals[]
  tcp_metrics: annotate data-races around tm->tcpm_lock
  tcp_metrics: annotate data-races around tm->tcpm_stamp
  tcp_metrics: fix addr_same() helper
  ip6mr: Fix skb_under_panic in ip6mr_cache_report()
  net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
  net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
  net: add missing data-race annotation for sk_ll_usec
  net: add missing data-race annotations around sk->sk_peek_off
  net: sched: cls_u32: Fix match key mis-addressing
  perf test uprobe_from_different_cu: Skip if there is no gcc
  net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
  KVM: s390: fix sthyi error handling
  word-at-a-time: use the same return type for has_zero regardless of endianness
  loop: Select I/O scheduler 'none' from inside add_disk()
  perf: Fix function pointer case
  net/sched: cls_u32: Fix reference counter leak leading to overflow
  ASoC: cs42l51: fix driver to properly autoload with automatic module loading
  net/sched: sch_qfq: account for stab overhead in qfq_enqueue
  net/sched: cls_fw: Fix improper refcount update leads to use-after-free
  drm/client: Fix memory leak in drm_client_target_cloned
  dm cache policy smq: ensure IO doesn't prevent cleaner policy progress
  ASoC: wm8904: Fill the cache for WM8904_ADC_TEST_0 register
  s390/dasd: fix hanging device after quiesce/resume
  virtio-net: fix race between set queues and probe
  serial: 8250_dw: Preserve original value of DLF register
  serial: 8250_dw: split Synopsys DesignWare 8250 common functions
  irq-bcm6345-l1: Do not assume a fixed block to cpu mapping
  tpm_tis: Explicitly check for error code
  btrfs: check for commit error at btrfs_attach_transaction_barrier()
  hwmon: (nct7802) Fix for temp6 (PECI1) processed even if PECI1 disabled
  staging: ks7010: potential buffer overflow in ks_wlan_set_encode_ext()
  Documentation: security-bugs.rst: clarify CVE handling
  Documentation: security-bugs.rst: update preferences when dealing with the linux-distros group
  usb: xhci-mtk: set the dma max_seg_size
  USB: quirks: add quirk for Focusrite Scarlett
  usb: ohci-at91: Fix the unhandle interrupt when resume
  usb: dwc3: don't reset device side if dwc3 was configured as host-only
  usb: dwc3: pci: skip BYT GPIO lookup table for hardwired phy
  Revert "usb: dwc3: core: Enable AutoRetry feature in the controller"
  can: gs_usb: gs_can_close(): add missing set of CAN state to CAN_STATE_STOPPED
  USB: serial: simple: sort driver entries
  USB: serial: simple: add Kaufmann RKS+CAN VCP
  USB: serial: option: add Quectel EC200A module support
  USB: serial: option: support Quectel EM060K_128
  tracing: Fix warning in trace_buffered_event_disable()
  ring-buffer: Fix wrong stat of cpu_buffer->read
  ata: pata_ns87415: mark ns87560_tf_read static
  dm raid: fix missing reconfig_mutex unlock in raid_ctr() error paths
  block: Fix a source code comment in include/uapi/linux/blkzoned.h
  ASoC: fsl_spdif: Silence output on stop
  drm/msm: Fix IS_ERR_OR_NULL() vs NULL check in a5xx_submit_in_rb()
  RDMA/mlx4: Make check for invalid flags stricter
  benet: fix return value check in be_lancer_xmit_workarounds()
  net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64
  net/sched: mqprio: add extack to mqprio_parse_nlattr()
  net/sched: mqprio: refactor nlattr parsing to a separate function
  platform/x86: msi-laptop: Fix rfkill out-of-sync on MSI Wind U100
  team: reset team's flags when down link is P2P device
  bonding: reset bond's flags when down link is P2P device
  tcp: Reduce chance of collisions in inet6_hashfn().
  ipv6 addrconf: fix bug where deleting a mngtmpaddr can create a new temporary address
  ethernet: atheros: fix return value check in atl1e_tso_csum()
  phy: hisilicon: Fix an out of bounds check in hisi_inno_phy_probe()
  i40e: Fix an NULL vs IS_ERR() bug for debugfs_create_dir()
  ext4: fix to check return value of freeze_bdev() in ext4_shutdown()
  scsi: qla2xxx: Array index may go out of bound
  scsi: qla2xxx: Fix inconsistent format argument type in qla_os.c
  ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
  ftrace: Store the order of pages allocated in ftrace_page
  ftrace: Check if pages were allocated before calling free_pages()
  ftrace: Add information on number of page groups allocated
  fs: dlm: interrupt posix locks only when process is killed
  dlm: rearrange async condition return
  dlm: cleanup plock_op vs plock_xop
  PCI/ASPM: Avoid link retraining race
  PCI/ASPM: Factor out pcie_wait_for_retrain()
  PCI/ASPM: Return 0 or -ETIMEDOUT from pcie_retrain_link()
  PCI: Rework pcie_retrain_link() wait loop
  ext4: Fix reusing stale buffer heads from last failed mounting
  ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
  btrfs: fix extent buffer leak after tree mod log failure at split_node()
  bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent
  bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()
  bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set
  gpio: tps68470: Make tps68470_gpio_output() always set the initial value
  tracing/histograms: Return an error if we fail to add histogram to hist_vars list
  tcp: annotate data-races around fastopenq.max_qlen
  tcp: annotate data-races around tp->notsent_lowat
  tcp: annotate data-races around rskq_defer_accept
  tcp: annotate data-races around tp->linger2
  net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX
  netfilter: nf_tables: can't schedule in nft_chain_validate
  netfilter: nf_tables: fix spurious set element insertion failure
  llc: Don't drop packet from non-root netns.
  fbdev: au1200fb: Fix missing IRQ check in au1200fb_drv_probe
  Revert "tcp: avoid the lookup process failing to get sk in ehash table"
  net:ipv6: check return value of pskb_trim()
  net: ethernet: ti: cpsw_ale: Fix cpsw_ale_get_field()/cpsw_ale_set_field()
  pinctrl: amd: Use amd_pinconf_set() for all config options
  fbdev: imxfb: warn about invalid left/right margin
  spi: bcm63xx: fix max prepend length
  igb: Fix igb_down hung on surprise removal
  wifi: iwlwifi: mvm: avoid baid size integer overflow
  wifi: wext-core: Fix -Wstringop-overflow warning in ioctl_standard_iw_point()
  bpf: Address KCSAN report on bpf_lru_list
  sched/fair: Don't balance task to its current running CPU
  posix-timers: Ensure timer ID search-loop limit is valid
  md/raid10: prevent soft lockup while flush writes
  md: fix data corruption for raid456 when reshape restart while grow up
  nbd: Add the maximum limit of allocated index in nbd_dev_add
  debugobjects: Recheck debug_objects_enabled before reporting
  ext4: correct inline offset when handling xattrs in inode body
  can: bcm: Fix UAF in bcm_proc_show()
  fuse: revalidate: don't invalidate if interrupted
  perf probe: Add test for regression introduced by switch to die_get_decl_file()
  tracing/histograms: Add histograms to hist_vars if they have referenced variables
  drm/atomic: Fix potential use-after-free in nonblocking commits
  scsi: qla2xxx: Pointer may be dereferenced
  scsi: qla2xxx: Check valid rport returned by fc_bsg_to_rport()
  scsi: qla2xxx: Fix potential NULL pointer dereference
  scsi: qla2xxx: Wait for io return on terminate rport
  xtensa: ISS: fix call to split_if_spec
  ring-buffer: Fix deadloop issue on reading trace_pipe
  tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk
  tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error
  Revert "8250: add support for ASIX devices with a FIFO bug"
  meson saradc: fix clock divider mask length
  ceph: don't let check_caps skip sending responses for revoke msgs
  hwrng: imx-rngc - fix the timeout for init and self check
  serial: atmel: don't enable IRQs prematurely
  fs: dlm: return positive pid value for F_GETLK
  md/raid0: add discard support for the 'original' layout
  misc: pci_endpoint_test: Re-init completion for every test
  misc: pci_endpoint_test: Free IRQs before removing the device
  PCI: rockchip: Use u32 variable to access 32-bit registers
  PCI: rockchip: Fix legacy IRQ generation for RK3399 PCIe endpoint core
  PCI: rockchip: Add poll and timeout to wait for PHY PLLs to be locked
  PCI: rockchip: Write PCI Device ID to correct register
  PCI: rockchip: Assert PCI Configuration Enable bit after probe
  PCI: qcom: Disable write access to read only registers for IP v2.3.3
  PCI: Add function 1 DMA alias quirk for Marvell 88SE9235
  PCI/PM: Avoid putting EloPOS E2/S2/H2 PCIe Ports in D3cold
  jfs: jfs_dmap: Validate db_l2nbperpage while mounting
  ext4: only update i_reserved_data_blocks on successful block allocation
  ext4: fix wrong unit use in ext4_mb_clear_bb
  perf intel-pt: Fix CYC timestamps after standalone CBR
  SUNRPC: Fix UAF in svc_tcp_listen_data_ready()
  net: bcmgenet: Ensure MDIO unregistration has clocks enabled
  tpm: tpm_vtpm_proxy: fix a race condition in /dev/vtpmx creation
  pinctrl: amd: Only use special debounce behavior for GPIO 0
  pinctrl: amd: Detect internal GPIO0 debounce handling
  pinctrl: amd: Fix mistake in handling clearing pins at startup
  net/sched: make psched_mtu() RTNL-less safe
  wifi: airo: avoid uninitialized warning in airo_get_rate()
  ipv6/addrconf: fix a potential refcount underflow for idev
  NTB: ntb_tool: Add check for devm_kcalloc
  NTB: ntb_transport: fix possible memory leak while device_register() fails
  ntb: intel: Fix error handling in intel_ntb_pci_driver_init()
  NTB: amd: Fix error handling in amd_ntb_pci_driver_init()
  ntb: idt: Fix error handling in idt_pci_driver_init()
  udp6: fix udp6_ehashfn() typo
  icmp6: Fix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev().
  vrf: Increment Icmp6InMsgs on the original netdev
  net: mvneta: fix txq_map in case of txq_number==1
  workqueue: clean up WORK_* constant types, clarify masking
  net: lan743x: Don't sleep in atomic context
  netfilter: nf_tables: prevent OOB access in nft_byteorder_eval
  netfilter: conntrack: Avoid nf_ct_helper_hash uses after free
  netfilter: nf_tables: fix scheduling-while-atomic splat
  netfilter: nf_tables: unbind non-anonymous set if rule construction fails
  netfilter: nf_tables: reject unbound anonymous set before commit phase
  netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
  netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
  netfilter: nf_tables: use net_generic infra for transaction data
  netfilter: add helper function to set up the nfnetlink header and use it
  netfilter: nftables: add helper function to set the base sequence number
  netfilter: nf_tables: add rescheduling points during loop detection walks
  netfilter: nf_tables: fix nat hook table deletion
  spi: spi-fsl-spi: allow changing bits_per_word while CS is still active
  spi: spi-fsl-spi: relax message sanity checking a little
  spi: spi-fsl-spi: remove always-true conditional in fsl_spi_do_one_msg
  ARM: orion5x: fix d2net gpio initialization
  btrfs: fix race when deleting quota root from the dirty cow roots list
  jffs2: reduce stack usage in jffs2_build_xattr_subsystem()
  integrity: Fix possible multiple allocation in integrity_inode_get()
  bcache: Remove unnecessary NULL point check in node allocations
  mmc: core: disable TRIM on Micron MTFC4GACAJCN-1M
  mmc: core: disable TRIM on Kingston EMMC04G-M627
  NFSD: add encoding of op_recall flag for write delegation
  ALSA: jack: Fix mutex call in snd_jack_report()
  i2c: xiic: Don't try to handle more interrupt events after error
  i2c: xiic: Defer xiic_wakeup() and __xiic_start_xfer() in xiic_process()
  sh: dma: Fix DMA channel offset calculation
  net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX
  tcp: annotate data races in __tcp_oow_rate_limited()
  net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode
  powerpc: allow PPC_EARLY_DEBUG_CPM only when SERIAL_CPM=y
  f2fs: fix error path handling in truncate_dnode()
  mailbox: ti-msgmgr: Fill non-message tx data fields with 0x0
  spi: bcm-qspi: return error if neither hif_mspi nor mspi is available
  Add MODULE_FIRMWARE() for FIRMWARE_TG357766.
  sctp: fix potential deadlock on &net->sctp.addr_wq_lock
  rtc: st-lpc: Release some resources in st_rtc_probe() in case of error
  mfd: stmpe: Only disable the regulators if they are enabled
  mfd: intel-lpss: Add missing check for platform_get_resource
  KVM: s390: fix KVM_S390_GET_CMMA_BITS for GFNs in memslot holes
  mfd: rt5033: Drop rt5033-battery sub-device
  usb: phy: phy-tahvo: fix memory leak in tahvo_usb_probe()
  extcon: Fix kernel doc of property capability fields to avoid warnings
  extcon: Fix kernel doc of property fields to avoid warnings
  media: usb: siano: Fix warning due to null work_func_t function pointer
  media: videodev2.h: Fix struct v4l2_input tuner index comment
  media: usb: Check az6007_read() return value
  sh: j2: Use ioremap() to translate device tree address into kernel memory
  w1: fix loop in w1_fini()
  block: change all __u32 annotations to __be32 in affs_hardblocks.h
  USB: serial: option: add LARA-R6 01B PIDs
  ARC: define ASM_NL and __ALIGN(_STR) outside #ifdef __ASSEMBLY__ guard
  ARCv2: entry: rewrite to enable use of double load/stores LDD/STD
  ARCv2: entry: avoid a branch
  ARCv2: entry: push out the Z flag unclobber from common EXCEPTION_PROLOGUE
  ARCv2: entry: comments about hardware auto-save on taken interrupts
  modpost: fix section mismatch message for R_ARM_{PC24,CALL,JUMP24}
  modpost: fix section mismatch message for R_ARM_ABS32
  crypto: nx - fix build warnings when DEBUG_FS is not enabled
  hwrng: virtio - Fix race on data_avail and actual data
  hwrng: virtio - always add a pending request
  hwrng: virtio - don't waste entropy
  hwrng: virtio - don't wait on cleanup
  hwrng: virtio - add an internal buffer
  pinctrl: at91-pio4: check return value of devm_kasprintf()
  perf dwarf-aux: Fix off-by-one in die_get_varname()
  pinctrl: cherryview: Return correct value if pin in push-pull mode
  PCI: Add pci_clear_master() stub for non-CONFIG_PCI
  scsi: 3w-xxxx: Add error handling for initialization failure in tw_probe()
  ALSA: ac97: Fix possible NULL dereference in snd_ac97_mixer
  drm/radeon: fix possible division-by-zero errors
  fbdev: omapfb: lcd_mipid: Fix an error handling path in mipid_spi_probe()
  arm64: dts: renesas: ulcb-kf: Remove flow control for SCIF1
  IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors
  soc/fsl/qe: fix usb.c build errors
  ASoC: es8316: Increment max value for ALC Capture Target Volume control
  ARM: ep93xx: fix missing-prototype warnings
  drm/panel: simple: fix active size for Ampire AM-480272H3TMQW-T01H
  Input: adxl34x - do not hardcode interrupt trigger type
  ARM: dts: BCM5301X: Drop "clock-names" from the SPI node
  Input: drv260x - sleep between polling GO bit
  radeon: avoid double free in ci_dpm_init()
  netlink: Add __sock_i_ino() for __netlink_diag_dump().
  ipvlan: Fix return value of ipvlan_queue_xmit()
  netfilter: nf_conntrack_sip: fix the ct_sip_parse_numerical_param() return value.
  lib/ts_bm: reset initial match offset for every block of text
  gtp: Fix use-after-free in __gtp_encap_destroy().
  netlink: do not hard code device address lenth in fdb dumps
  netlink: fix potential deadlock in netlink_set_err()
  wifi: ath9k: convert msecs to jiffies where needed
  wifi: ath9k: Fix possible stall on ath9k_txq_list_has_key()
  memstick r592: make memstick_debug_get_tpc_name() static
  kexec: fix a memory leak in crash_shrink_memory()
  watchdog/perf: more properly prevent false positives with turbo modes
  watchdog/perf: define dummy watchdog_update_hrtimer_threshold() on correct config
  wifi: rsi: Do not set MMC_PM_KEEP_POWER in shutdown
  wifi: ath9k: don't allow to overwrite ENDPOINT0 attributes
  wifi: ray_cs: Fix an error handling path in ray_probe()
  wifi: ray_cs: Drop useless status variable in parse_addr()
  wifi: ray_cs: Utilize strnlen() in parse_addr()
  wifi: wl3501_cs: Fix an error handling path in wl3501_probe()
  wl3501_cs: use eth_hw_addr_set()
  net: create netdev->dev_addr assignment helpers
  wl3501_cs: Fix misspelling and provide missing documentation
  wl3501_cs: Remove unnecessary NULL check
  wl3501_cs: Fix a bunch of formatting issues related to function docs
  wifi: atmel: Fix an error handling path in atmel_probe()
  wifi: orinoco: Fix an error handling path in orinoco_cs_probe()
  wifi: orinoco: Fix an error handling path in spectrum_cs_probe()
  nfc: llcp: fix possible use of uninitialized variable in nfc_llcp_send_connect()
  nfc: constify several pointers to u8, char and sk_buff
  wifi: mwifiex: Fix the size of a memory allocation in mwifiex_ret_802_11_scan()
  samples/bpf: Fix buffer overflow in tcp_basertt
  wifi: ath9k: avoid referencing uninit memory in ath9k_wmi_ctrl_rx
  wifi: ath9k: fix AR9003 mac hardware hang check register offset calculation
  evm: Complete description of evm_inode_setattr()
  ARM: 9303/1: kprobes: avoid missing-declaration warnings
  PM: domains: fix integer overflow issues in genpd_parse_state()
  clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe
  clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
  clocksource/drivers: Unify the names to timer-* format
  irqchip/jcore-aic: Fix missing allocation of IRQ descriptors
  irqchip/jcore-aic: Kill use of irq_create_strict_mappings()
  md/raid10: fix io loss while replacement replace rdev
  md/raid10: fix wrong setting of max_corr_read_errors
  md/raid10: fix overflow of md/safe_mode_delay
  md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
  treewide: Remove uninitialized_var() usage
  drm/amdgpu: Validate VM ioctl flags.
  scripts/tags.sh: Resolve gtags empty index generation
  drm/edid: Fix uninitialized variable in drm_cvt_modes()
  fbdev: imsttfb: Fix use after free bug in imsttfb_probe
  video: imsttfb: check for ioremap() failures
  x86/smp: Use dedicated cache-line for mwait_play_dead()
  gfs2: Don't deref jdesc in evict
  Linux 4.19.290
  x86: fix backwards merge of GDS/SRSO bit
  xen/netback: Fix buffer overrun triggered by unusual packet
  Documentation/x86: Fix backwards on/off logic about YMM support
  x86/xen: Fix secondary processors' FPU initialization
  KVM: Add GDS_NO support to KVM
  x86/speculation: Add Kconfig option for GDS
  x86/speculation: Add force option to GDS mitigation
  x86/speculation: Add Gather Data Sampling mitigation
  x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
  x86/fpu: Mark init functions __init
  x86/fpu: Remove cpuinfo argument from init functions
  init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
  init: Invoke arch_cpu_finalize_init() earlier
  init: Remove check_bugs() leftovers
  um/cpu: Switch to arch_cpu_finalize_init()
  sparc/cpu: Switch to arch_cpu_finalize_init()
  sh/cpu: Switch to arch_cpu_finalize_init()
  mips/cpu: Switch to arch_cpu_finalize_init()
  m68k/cpu: Switch to arch_cpu_finalize_init()
  ia64/cpu: Switch to arch_cpu_finalize_init()
  ARM: cpu: Switch to arch_cpu_finalize_init()
  x86/cpu: Switch to arch_cpu_finalize_init()
  init: Provide arch_cpu_finalize_init()

 Conflicts:
	drivers/mmc/core/block.c
	drivers/mmc/host/sdhci-msm.c
	drivers/usb/dwc3/core.c
	drivers/usb/dwc3/gadget.c

Change-Id: Id2f4d5c8067f8e5eda39c0eaa5e59d54a394b4c7
2023-09-19 18:11:03 +03:00
Kees Cook
b7e389235c treewide: Remove uninitialized_var() usage
commit 3f649ab728cda8038259d8f14492fe400fbab911 upstream.

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 11:45:01 +02:00
Ivaylo Georgiev
5996b2fe7b Merge android-4.19.63 (75ff56e) into msm-4.19
* refs/heads/tmp-75ff56e:
  Linux 4.19.63
  access: avoid the RCU grace period for the temporary subjective credentials
  libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
  powerpc/tm: Fix oops on sigreturn on systems without TM
  powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask()
  ALSA: hda - Add a conexant codec entry to let mute led work
  ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1
  ALSA: ac97: Fix double free of ac97_codec_device
  hpet: Fix division by zero in hpet_time_div()
  mei: me: add mule creek canyon (EHL) device ids
  fpga-manager: altera-ps-spi: Fix build error
  binder: prevent transactions to context manager from its own process.
  x86/speculation/mds: Apply more accurate check on hypervisor platform
  x86/sysfb_efi: Add quirks for some devices with swapped width and height
  btrfs: inode: Don't compress if NODATASUM or NODATACOW set
  usb: pci-quirks: Correct AMD PLL quirk detection
  usb: wusbcore: fix unbalanced get/put cluster_id
  locking/lockdep: Hide unused 'class' variable
  mm: use down_read_killable for locking mmap_sem in access_remote_vm
  locking/lockdep: Fix lock used or unused stats error
  proc: use down_read_killable mmap_sem for /proc/pid/maps
  cxgb4: reduce kernel stack usage in cudbg_collect_mem_region()
  proc: use down_read_killable mmap_sem for /proc/pid/map_files
  proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
  proc: use down_read_killable mmap_sem for /proc/pid/pagemap
  proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
  mm/mmu_notifier: use hlist_add_head_rcu()
  memcg, fsnotify: no oom-kill for remote memcg charging
  mm/gup.c: remove some BUG_ONs from get_gate_page()
  mm/gup.c: mark undo_dev_pagemap as __maybe_unused
  9p: pass the correct prototype to read_cache_page
  mm/kmemleak.c: fix check for softirq context
  sh: prevent warnings when using iounmap
  block/bio-integrity: fix a memory leak bug
  powerpc/eeh: Handle hugepages in ioremap space
  dlm: check if workqueues are NULL before flushing/destroying
  mailbox: handle failed named mailbox channel request
  f2fs: avoid out-of-range memory access
  block: init flush rq ref count to 1
  powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h
  PCI: dwc: pci-dra7xx: Fix compilation when !CONFIG_GPIOLIB
  RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
  perf hists browser: Fix potential NULL pointer dereference found by the smatch tool
  perf annotate: Fix dereferencing freed memory found by the smatch tool
  perf session: Fix potential NULL pointer dereference found by the smatch tool
  perf top: Fix potential NULL pointer dereference detected by the smatch tool
  perf stat: Fix use-after-freed pointer detected by the smatch tool
  perf test mmap-thread-lookup: Initialize variable to suppress memory sanitizer warning
  PCI: mobiveil: Use the 1st inbound window for MEM inbound transactions
  PCI: mobiveil: Initialize Primary/Secondary/Subordinate bus numbers
  kallsyms: exclude kasan local symbols on s390
  PCI: mobiveil: Fix the Class Code field
  PCI: mobiveil: Fix PCI base address in MEM/IO outbound windows
  arm64: assembler: Switch ESB-instruction with a vanilla nop if !ARM64_HAS_RAS
  IB/ipoib: Add child to parent list only if device initialized
  powerpc/mm: Handle page table allocation failures
  IB/mlx5: Fixed reporting counters on 2nd port for Dual port RoCE
  serial: sh-sci: Fix TX DMA buffer flushing and workqueue races
  serial: sh-sci: Terminate TX DMA during buffer flushing
  RDMA/i40iw: Set queue pair state when being queried
  powerpc/4xx/uic: clear pending interrupt after irq type/pol change
  um: Silence lockdep complaint about mmap_sem
  mm/swap: fix release_pages() when releasing devmap pages
  mfd: hi655x-pmic: Fix missing return value check for devm_regmap_init_mmio_clk
  mfd: arizona: Fix undefined behavior
  mfd: core: Set fwnode for created devices
  mfd: madera: Add missing of table registration
  recordmcount: Fix spurious mcount entries on powerpc
  powerpc/xmon: Fix disabling tracing while in xmon
  powerpc/cacheflush: fix variable set but not used
  iio: iio-utils: Fix possible incorrect mask calculation
  PCI: xilinx-nwl: Fix Multi MSI data programming
  genksyms: Teach parser about 128-bit built-in types
  kbuild: Add -Werror=unknown-warning-option to CLANG_FLAGS
  i2c: stm32f7: fix the get_irq error cases
  PCI: sysfs: Ignore lockdep for remove attribute
  serial: mctrl_gpio: Check if GPIO property exisits before requesting it
  drm/msm: Depopulate platform on probe failure
  powerpc/pci/of: Fix OF flags parsing for 64bit BARs
  mmc: sdhci: sdhci-pci-o2micro: Check if controller supports 8-bit width
  usb: gadget: Zero ffs_io_data
  tty: serial_core: Set port active bit in uart_port_activate
  serial: imx: fix locking in set_termios()
  drm/rockchip: Properly adjust to a true clock in adjusted_mode
  powerpc/pseries/mobility: prevent cpu hotplug during DT update
  drm/amd/display: fix compilation error
  phy: renesas: rcar-gen2: Fix memory leak at error paths
  drm/virtio: Add memory barriers for capset cache.
  drm/amd/display: Always allocate initial connector state state
  serial: 8250: Fix TX interrupt handling condition
  tty: serial: msm_serial: avoid system lockup condition
  tty/serial: digicolor: Fix digicolor-usart already registered warning
  memstick: Fix error cleanup path of memstick_init
  drm/crc-debugfs: Also sprinkle irqrestore over early exits
  drm/crc-debugfs: User irqsafe spinlock in drm_crtc_add_crc_entry
  gpu: host1x: Increase maximum DMA segment size
  drm/bridge: sii902x: pixel clock unit is 10kHz instead of 1kHz
  drm/bridge: tc358767: read display_props in get_modes()
  PCI: Return error if cannot probe VF
  drm/edid: Fix a missing-check bug in drm_load_edid_firmware()
  drm/amdkfd: Fix sdma queue map issue
  drm/amdkfd: Fix a potential memory leak
  drm/amd/display: Disable ABM before destroy ABM struct
  drm/amdgpu/sriov: Need to initialize the HDP_NONSURFACE_BAStE
  drm/amd/display: Fill prescale_params->scale for RGB565
  tty: serial: cpm_uart - fix init when SMC is relocated
  pinctrl: rockchip: fix leaked of_node references
  tty: max310x: Fix invalid baudrate divisors calculator
  usb: core: hub: Disable hub-initiated U1/U2
  staging: vt6656: use meaningful error code during buffer allocation
  iio: adc: stm32-dfsdm: missing error case during probe
  iio: adc: stm32-dfsdm: manage the get_irq error case
  drm/panel: simple: Fix panel_simple_dsi_probe
  hvsock: fix epollout hang from race condition

Change-Id: I4c37256db5ec08367a22a1c50bb97db267c822da
Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>
2019-08-13 01:20:38 -07:00
Ira Weiny
30edc7c1fe mm/swap: fix release_pages() when releasing devmap pages
[ Upstream commit c5d6c45e90c49150670346967971e14576afd7f1 ]

release_pages() is an optimized version of a loop around put_page().
Unfortunately for devmap pages the logic is not entirely correct in
release_pages().  This is because device pages can be more than type
MEMORY_DEVICE_PUBLIC.  There are in fact 4 types, private, public, FS DAX,
and PCI P2PDMA.  Some of these have specific needs to "put" the page while
others do not.

This logic to handle any special needs is contained in
put_devmap_managed_page().  Therefore all devmap pages should be processed
by this function where we can contain the correct logic for a page put.

Handle all device type pages within release_pages() by calling
put_devmap_managed_page() on all devmap pages.  If
put_devmap_managed_page() returns true the page has been put and we
continue with the next page.  A false return of put_devmap_managed_page()
means the page did not require special processing and should fall to
"normal" processing.

This was found via code inspection while determining if release_pages()
and the new put_user_pages() could be interchangeable.[1]

[1] https://lkml.kernel.org/r/20190523172852.GA27175@iweiny-DESK2.sc.intel.com

Link: https://lkml.kernel.org/r/20190605214922.17684-1-ira.weiny@intel.com
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31 07:27:03 +02:00
qctecmdr
1e525b68fe Merge "Merge android-4.19.31 (bb418a1) into msm-4.19" 2019-05-01 10:57:19 -07:00
Laurent Dufour
808c47e1d5 mm: introduce __lru_cache_add_active_or_unevictable
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Change-Id: I68decbe0f80847403127c45c97565e47512532e9
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:20
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
[charante@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
2019-03-29 03:09:25 -07:00
Michal Hocko
e6e9d6e290 mm: handle lru_add_drain_all for UP properly
[ Upstream commit 6ea183d60c469560e7b08a83c9804299e84ec9eb ]

Since for_each_cpu(cpu, mask) added by commit 2d3854a37e
("cpumask: introduce new API, without changing anything") did not
evaluate the mask argument if NR_CPUS == 1 due to CONFIG_SMP=n,
lru_add_drain_all() is hitting WARN_ON() at __flush_work() added by
commit 4d43d395fed12463 ("workqueue: Try to catch flush_work() without
INIT_WORK().") by unconditionally calling flush_work() [1].

Workaround this issue by using CONFIG_SMP=n specific lru_add_drain_all
implementation.  There is no real need to defer the implementation to
the workqueue as the draining is going to happen on the local cpu.  So
alias lru_add_drain_all to lru_add_drain which does all the necessary
work.

[akpm@linux-foundation.org: fix various build warnings]
[1] https://lkml.kernel.org/r/18a30387-6aa5-6123-e67c-57579ecc3f38@roeck-us.net
Link: http://lkml.kernel.org/r/20190213124334.GH4525@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Debugged-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:50 +01:00
Dan Williams
e763848843 mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 06:59:39 -07:00
Mike Rapoport
002843de36 mm/swap.c: remove @cold parameter description for release_pages()
The 'cold' parameter was removed from release_pages function by commit
c6f92f9fbe ("mm: remove cold parameter for release_pages").

Update the description to match the code.

Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Mike Rapoport
cb6f0f3480 mm/swap.c: make functions and their kernel-doc agree (again)
There was a conflict between the commit e02a9f048e ("mm/swap.c: make
functions and their kernel-doc agree") and the commit f144c390f9 ("mm:
docs: fix parameter names mismatch") that both tried to fix mismatch
betweeen pagevec_lookup_entries() parameter names and their description.

Since nr_entries is a better name for the parameter, fix the description
again.

Link: http://lkml.kernel.org/r/1518116946-20947-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-21 15:35:43 -08:00
Shakeel Butt
9c4e6b1a70 mm, mlock, vmscan: no more skipping pagevecs
When a thread mlocks an address space backed either by file pages which
are currently not present in memory or swapped out anon pages (not in
swapcache), a new page is allocated and added to the local pagevec
(lru_add_pvec), I/O is triggered and the thread then sleeps on the page.
On I/O completion, the thread can wake on a different CPU, the mlock
syscall will then sets the PageMlocked() bit of the page but will not be
able to put that page in unevictable LRU as the page is on the pagevec
of a different CPU.  Even on drain, that page will go to evictable LRU
because the PageMlocked() bit is not checked on pagevec drain.

The page will eventually go to right LRU on reclaim but the LRU stats
will remain skewed for a long time.

This patch puts all the pages, even unevictable, to the pagevecs and on
the drain, the pages will be added on their LRUs correctly by checking
their evictability.  This resolves the mlocked pages on pagevec of other
CPUs issue because when those pagevecs will be drained, the mlocked file
pages will go to unevictable LRU.  Also this makes the race with munlock
easier to resolve because the pagevec drains happen in LRU lock.

However there is still one place which makes a page evictable and does
PageLRU check on that page without LRU lock and needs special attention.
TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock().

	#0: __pagevec_lru_add_fn	#1: clear_page_mlock

	SetPageLRU()			if (!TestClearPageMlocked())
					  return
	smp_mb() // <--required
					// inside does PageLRU
	if (!PageMlocked())		if (isolate_lru_page())
	  move to evictable LRU		  putback_lru_page()
	else
	  move to unevictable LRU

In '#1', TestClearPageMlocked() provides full memory barrier semantics
and thus the PageLRU check (inside isolate_lru_page) can not be
reordered before it.

In '#0', without explicit memory barrier, the PageMlocked() check can be
reordered before SetPageLRU().  If that happens, '#0' can put a page in
unevictable LRU and '#1' might have just cleared the Mlocked bit of that
page but fails to isolate as PageLRU fails as '#0' still hasn't set
PageLRU bit of that page.  That page will be stranded on the unevictable
LRU.

There is one (good) side effect though.  Without this patch, the pages
allocated for System V shared memory segment are added to evictable LRUs
even after shmctl(SHM_LOCK) on that segment.  This patch will correctly
put such pages to unevictable LRU.

Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-21 15:35:42 -08:00
Mike Rapoport
f144c390f9 mm: docs: fix parameter names mismatch
There are several places where parameter descriptions do no match the
actual code.  Fix it.

Link: http://lkml.kernel.org/r/1516700871-22279-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06 18:32:48 -08:00
Randy Dunlap
e02a9f048e mm/swap.c: make functions and their kernel-doc agree
Fix some basic kernel-doc notation in mm/swap.c:

 - for function lru_cache_add_anon(), make its kernel-doc function name
   match its function name and change colon to hyphen following the
   function name

 - for function pagevec_lookup_entries(), change the function parameter
   name from nr_pages to nr_entries since that is more descriptive of
   what the parameter actually is and then it matches the kernel-doc
   comments also

Fix function kernel-doc to match the change in commit 67fd707f46:

 - drop the kernel-doc notation for @nr_pages from
   pagevec_lookup_range() and correct the function description for that
   change

Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org
Fixes: 67fd707f46 ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko
9852a72123 mm: drop hotplug lock from lru_add_drain_all()
Pulling cpu hotplug locks inside the mm core function like
lru_add_drain_all just asks for problems and the recent lockdep splat
[1] just proves this.  While the usage in that particular case might be
wrong we should avoid the locking as lru_add_drain_all() is used in many
places.  It seems that this is not all that hard to achieve actually.

We have done the same thing for drain_all_pages which is analogous by
commit a459eeb7b8 ("mm, page_alloc: do not depend on cpu hotplug locks
inside the allocator").  All we have to care about is to handle

      - the work item might be executed on a different cpu in worker from
        unbound pool so it doesn't run on pinned on the cpu

      - we have to make sure that we do not race with page_alloc_cpu_dead
        calling lru_add_drain_cpu

the first part is already handled because the worker calls lru_add_drain
which disables preemption when calling lru_add_drain_cpu on the local
cpu it is draining.  The later is true because page_alloc_cpu_dead is
called on the controlling CPU after the hotplugged CPU vanished
completely.

[1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com

[add a cpu hotplug locking interaction as per tglx]
Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:36 -08:00
Mel Gorman
7f0b5fb953 mm, pagevec: rename pagevec drained field
According to Vlastimil Babka, the drained field in pagevec is
potentially misleading because it might be interpreted as draining this
pagevec instead of the percpu lru pagevecs.  Rename the field for
clarity.

Link: http://lkml.kernel.org/r/20171019093346.ylahzdpzmoriyf4v@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman
2d4894b5d2 mm: remove cold parameter from free_hot_cold_page*
Most callers users of free_hot_cold_page claim the pages being released
are cache hot.  The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost.  As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.

The APIs are renamed to indicate that it's no longer about hot/cold
pages.  It should also be less confusing as there are subtle differences
between them.  __free_pages drops a reference and frees a page when the
refcount reaches zero.  free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name.  free_unref_page
should be more obvious.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

[mgorman@techsingularity.net: add pages to head, not tail]
  Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman
c6f92f9fbe mm: remove cold parameter for release_pages
All callers of release_pages claim the pages being released are cache
hot.  As no one cares about the hotness of pages being released to the
allocator, just ditch the parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman
8667982014 mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman
d9ed0d08b6 mm: only drain per-cpu pagevecs once per pagevec usage
When a pagevec is initialised on the stack, it is generally used
multiple times over a range of pages, looking up entries and then
releasing them.  On each pagevec_release, the per-cpu deferred LRU
pagevecs are drained on the grounds the page being released may be on
those queues and the pages may be cache hot.  In many cases only the
first drain is necessary as it's unlikely that the range of pages being
walked is racing against LRU addition.  Even if there is such a race,
the impact is marginal where as constantly redraining the lru pagevecs
costs.

This patch ensures that pagevec is only drained once in a given
lifecycle without increasing the cache footprint of the pagevec
structure.  Only sparsetruncate tiny is shown here as large files have
many exceptional entries and calls pagecache_release less frequently.

sparsetruncate (tiny)
                              4.14.0-rc4             4.14.0-rc4
                        batchshadow-v1r1          onedrain-v1r1
Min          Time      141.00 (   0.00%)      141.00 (   0.00%)
1st-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
Max-95%      Time      146.00 (   0.00%)      145.00 (   0.68%)
Max-99%      Time      198.00 (   0.00%)      194.00 (   2.02%)
Max          Time      254.00 (   0.00%)      208.00 (  18.11%)
Amean        Time      145.12 (   0.00%)      144.30 (   0.56%)
Stddev       Time       12.74 (   0.00%)        9.62 (  24.49%)
Coeff        Time        8.78 (   0.00%)        6.67 (  24.06%)
Best99%Amean Time      144.29 (   0.00%)      143.82 (   0.32%)
Best95%Amean Time      142.68 (   0.00%)      142.31 (   0.26%)
Best90%Amean Time      142.52 (   0.00%)      142.19 (   0.24%)
Best75%Amean Time      142.26 (   0.00%)      141.98 (   0.20%)
Best50%Amean Time      141.90 (   0.00%)      141.71 (   0.13%)
Best25%Amean Time      141.80 (   0.00%)      141.43 (   0.26%)

The impact on bonnie is marginal and within the noise because a
significant percentage of the file being truncated has been reclaimed
and consists of shadow entries which reduce the hotness of the
pagevec_release path.

Link: http://lkml.kernel.org/r/20171018075952.10627-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
67fd707f46 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
93d3b7140a mm: add variant of pagevec_lookup_range_tag() taking number of pages
Currently pagevec_lookup_range_tag() takes number of pages to look up
but most users don't need this.  Create a new function
pagevec_lookup_range_nr_tag() that takes maximum number of pages to
lookup for Ceph which wants this functionality so that we can drop
nr_pages argument from pagevec_lookup_range_tag().

Link: http://lkml.kernel.org/r/20171009151359.31984-13-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
72b045aecd mm: implement find_get_pages_range_tag()
Patch series "Ranged pagevec tagged lookup", v3.

In this series I provide a ranged variant of pagevec_lookup_tag() and
use it in places where it makes sense.  This series removes some common
code and it also has a potential for speeding up some operations
similarly as for pagevec_lookup_range() (but for now I can think of only
artificial cases where this happens).

This patch (of 16):

Implement a variant of find_get_pages_tag() that stops iterating at
given index.  Lots of users of this function (through pagevec_lookup())
actually want a range lookup and all of them are currently open-coding
this.

Also create corresponding pagevec_lookup_range_tag() function.

Link: http://lkml.kernel.org/r/20171009151359.31984-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Steve French <sfrench@samba.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Shaohua Li
24c92eb7dc mm: avoid marking swap cached page as lazyfree
MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
SwapBacked).  There is no lock to prevent the page is added to swap
cache between these two steps by page reclaim.  Page reclaim could add
the page to swap cache and unmap the page.  After page reclaim, the page
is added back to lru.  At that time, we probably start draining per-cpu
pagevec and mark the page lazyfree.  So the page could be in a state
with SwapBacked cleared and PG_swapcache set.  Next time there is a
refault in the virtual address, do_swap_page can find the page from swap
cache but the page has PageSwapCache false because SwapBacked isn't set,
so do_swap_page will bail out and do nothing.  The task will keep
running into fault handler.

Fixes: 802a3a92ad ("mm: reclaim MADV_FREE pages")
Link: http://lkml.kernel.org/r/6537ef3814398c0073630b03f176263bc81f0902.1506446061.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Reported-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>	[4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:24 -07:00
Jérôme Glisse
df6ad69838 mm/device-public-memory: device memory cache coherent with CPU
Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion.  Add a new type of
ZONE_DEVICE to represent such memory.  The use case are the same as for
the un-addressable device memory but without all the corners cases.

Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:46 -07:00
Jan Kara
397162ffa2 mm: remove nr_pages argument from pagevec_lookup{,_range}()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.

Just drop the argument.

Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
b947cee4b9 mm: implement find_get_pages_range()
Implement a variant of find_get_pages() that stops iterating at given
index.  This may be substantial performance gain if the mapping is
sparse.  See following commit for details.  Furthermore lots of users of
this function (through pagevec_lookup()) actually want a range lookup
and all of them are currently open-coding this.

Also create corresponding pagevec_lookup_range() function.

Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
d72dc8a25a mm: make pagevec_lookup() update index
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue.  Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Thomas Gleixner
a47fed5b5b mm: swap: provide lru_add_drain_all_cpuslocked()
The rework of the cpu hotplug locking unearthed potential deadlocks with
the memory hotplug locking code.

The solution for these is to rework the memory hotplug locking code as
well and take the cpu hotplug lock before the memory hotplug lock in
mem_hotplug_begin(), but this will cause a recursive locking of the cpu
hotplug lock when the memory hotplug code calls lru_add_drain_all().

Split out the inner workings of lru_add_drain_all() into
lru_add_drain_all_cpuslocked() so this function can be invoked from the
memory hotplug code with the cpu hotplug lock held.

Link: http://lkml.kernel.org/r/20170704093421.419329357@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:33 -07:00
Roman Gushchin
2262185c5b mm: per-cgroup memory reclaim stats
Track the following reclaim counters for every memory cgroup: PGREFILL,
PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

These values are exposed using the memory.stats interface of cgroup v2.

The meaning of each value is the same as for global counters, available
using /proc/vmstat.

Also, for consistency, rename mem_cgroup_count_vm_event() to
count_memcg_event_mm().

Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:35 -07:00
Shaohua Li
f7ad2a6cb9 mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
anonymous pages, but they can be freed without pageout.  To distinguish
these from normal anonymous pages, we clear their SwapBacked flag.

MADV_FREE pages could be freed without pageout, so they pretty much like
used once file pages.  For such pages, we'd like to reclaim them once
there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
pages always before used once file pages and we definitively want to
reclaim the pages before other anonymous and file pages.

To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages.  Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages.  And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support.  This idea is
suggested by Johannes.

This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
avoid bisect failure, next patch will do it.

The patch is based on Minchan's original patch.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:08 -07:00
Linus Torvalds
d3b5d35290 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "The main x86 MM changes in this cycle were:

   - continued native kernel PCID support preparation patches to the TLB
     flushing code (Andy Lutomirski)

   - various fixes related to 32-bit compat syscall returning address
     over 4Gb in applications, launched from 64-bit binaries - motivated
     by C/R frameworks such as Virtuozzo. (Dmitry Safonov)

   - continued Intel 5-level paging enablement: in particular the
     conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov)

   - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel)

   - ... plus misc updates, fixes and cleanups"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
  mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
  x86/mm: Fix flush_tlb_page() on Xen
  x86/mm: Make flush_tlb_mm_range() more predictable
  x86/mm: Remove flush_tlb() and flush_tlb_current_task()
  x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
  x86/mm/64: Fix crash in remove_pagetable()
  Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
  x86/boot/e820: Remove a redundant self assignment
  x86/mm: Fix dump pagetables for 4 levels of page tables
  x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow
  x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
  Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
  x86/espfix: Add support for 5-level paging
  x86/kasan: Extend KASAN to support 5-level paging
  x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
  x86/paravirt: Add 5-level support to the paravirt code
  x86/mm: Define virtual memory map for 5-level paging
  x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert
  x86/boot: Detect 5-level paging support
  x86/mm/numa: Remove numa_nodemask_from_meminfo()
  ...
2017-05-01 23:54:56 -07:00
Dan Williams
7138970383 mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash
The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.

The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)

But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.

One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)

So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

This speeds up things while also making the pmem refcounting more robust going
forward.

Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-01 09:15:53 +02:00
Michal Hocko
ce612879dd mm: move pcp and lru-pcp draining into single wq
We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
per cpu lru caches.  This seems more than necessary because both can run
on a single WQ.  Both do not block on locks requiring a memory
allocation nor perform any allocations themselves.  We will save one
rescuer thread this way.

On the other hand drain_all_pages() queues work on the system wq which
doesn't have rescuer and so this depend on memory allocation (when all
workers are stuck allocating and new ones cannot be created).

Initially we thought this would be more of a theoretical problem but
Hugh Dickins has reported:

: 4.11-rc has been giving me hangs after hours of swapping load.  At
: first they looked like memory leaks ("fork: Cannot allocate memory");
: but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
: before looking at /proc/meminfo one time, and the stat_refresh stuck
: in D state, waiting for completion of flush_work like many kworkers.
: kthreadd waiting for completion of flush_work in drain_all_pages().

This worker should be using WQ_RECLAIM as well in order to guarantee a
forward progress.  We can reuse the same one as for lru draining and
vmstat.

Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Yang Li <pku.leo@gmail.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:49 -07:00
Johannes Weiner
c55e8d035b mm: vmscan: move dirty pages out of the way until they're flushed
We noticed a performance regression when moving hadoop workloads from
3.10 kernels to 4.0 and 4.6.  This is accompanied by increased pageout
activity initiated by kswapd as well as frequent bursts of allocation
stalls and direct reclaim scans.  Even lowering the dirty ratios to the
equivalent of less than 1% of memory would not eliminate the issue,
suggesting that dirty pages concentrate where the scanner is looking.

This can be traced back to recent efforts of thrash avoidance.  Where
3.10 would not detect refaulting pages and continuously supply clean
cache to the inactive list, a thrashing workload on 4.0+ will detect and
activate refaulting pages right away, distilling used-once pages on the
inactive list much more effectively.  This is by design, and it makes
sense for clean cache.  But for the most part our workload's cache
faults are refaults and its use-once cache is from streaming writes.  We
end up with most of the inactive list dirty, and we don't go after the
active cache as long as we have use-once pages around.

But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off.  Even if the refaults happen, reads are
faster than writes.  Before getting bogged down on writeback, reclaim
should first look at *all* cache in the system, even active cache.

To accomplish this, activate pages that are dirty or under writeback
when they reach the end of the inactive LRU.  The pages are marked for
immediate reclaim, meaning they'll get moved back to the inactive LRU
tail as soon as they're written back and become reclaimable.  But in the
meantime, by reducing the inactive list to only immediately reclaimable
pages, we allow the scanner to deactivate and refill the inactive list
with clean cache from the active list tail to guarantee forward
progress.

[hannes@cmpxchg.org: update comment]
  Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org
Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Huang, Ying
4b3ef9daa4 mm/swap: split swap cache into 64MB trunks
The patch is to improve the scalability of the swap out/in via using
fine grained locks for the swap cache.  In current kernel, one address
space will be used for each swap device.  And in the common
configuration, the number of the swap device is very small (one is
typical).  This causes the heavy lock contention on the radix tree of
the address space if multiple tasks swap out/in concurrently.

But in fact, there is no dependency between pages in the swap cache.  So
that, we can split the one shared address space for each swap device
into several address spaces to reduce the lock contention.  In the
patch, the shared address space is split into 64MB trunks.  64MB is
chosen to balance the memory space usage and effect of lock contention
reduction.

The size of struct address_space on x86_64 architecture is 408B, so with
the patch, 6528B more memory will be used for every 1GB swap space on
x86_64 architecture.

One address space is still shared for the swap entries in the same 64M
trunks.  To avoid lock contention for the first round of swap space
allocation, the order of the swap clusters in the initial free clusters
list is changed.  The swap space distance between the consecutive swap
clusters in the free cluster list is at least 64M.  After the first
round of allocation, the swap clusters are expected to be freed
randomly, so the lock contention should be reduced effectively.

Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
Nicholas Piggin
6290602709 mm: add PageWaiters indicating tasks are waiting for a page bit
Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.

This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.

The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).

This improves the performance of page lock intensive microbenchmarks by
2-3%.

Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-25 11:54:48 -08:00
Aaron Lu
6fcb52a56f thp: reduce usage of huge zero page's atomic counter
The global zero page is used to satisfy an anonymous read fault.  If
THP(Transparent HugePage) is enabled then the global huge zero page is
used.  The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults.  This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
the global counter in two cases:

 1 The first time it uses the global huge zero page;
 2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away.  With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either.  Since no kthread is using huge zero page today, there
is no difference after applying this patch.  But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

  usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

  CPU cycles from perf report for base commit:
      54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
  CPU cycles from perf report for this commit:
       0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Mel Gorman
68eb0731c4 mm, pagevec: release/reacquire lru_lock on pgdat change
With node-lru, the locking is based on the pgdat.  Previously it was
required that a pagevec drain released one zone lru_lock and acquired
another zone lru_lock on every zone change.  Now, it's only necessary if
the node changes.  The end-result is fewer lock release/acquires if the
pages are all on the same node but in different zones.

Link: http://lkml.kernel.org/r/1468588165-12461-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
599d0c954f mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic.  Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes.  It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
a52633d8e9 mm, vmscan: move lru_lock to the node
Node-based reclaim requires node-based LRUs and locking.  This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review.  It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming
from different zones.

Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Kirill A. Shutemov
800d8c63b2 shmem: add huge pages support
Here's basic implementation of huge pages support for shmem/tmpfs.

It's all pretty streight-forward:

  - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

  - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

  - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

  - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Lukasz Odzioba
8f182270df mm/swap.c: flush lru pvecs on compound page arrival
Currently we can have compound pages held on per cpu pagevecs, which
leads to a lot of memory unavailable for reclaim when needed.  In the
systems with hundreads of processors it can be GBs of memory.

On of the way of reproducing the problem is to not call munmap
explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
that some pages (with THP enabled also huge pages) may end up on
lru_add_pvec, example below.

  void main() {
  #pragma omp parallel
  {
	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
	if (p != MAP_FAILED)
		memset(p, 0, size);
	//munmap(p, size); // uncomment to make the problem go away
  }
  }

When we run it with THP enabled it will leave significant amount of
memory on lru_add_pvec.  This memory will be not reclaimed if we hit
OOM, so when we run above program in a loop:

	for i in `seq 100`; do ./a.out; done

many processes (95% in my case) will be killed by OOM.

The primary point of the LRU add cache is to save the zone lru_lock
contention with a hope that more pages will belong to the same zone and
so their addition can be batched.  The huge page is already a form of
batched addition (it will add 512 worth of memory in one go) so skipping
the batching seems like a safer option when compared to a potential
excess in the caching which can be quite large and much harder to fix
because lru_add_drain_all is way to expensive and it is not really clear
what would be a good moment to call it.

Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
madvise(p, size, MADV_FREE); after memset.

This patch flushes lru pvecs on compound page arrival making the problem
less severe - after applying it kill rate of above example drops to 0%,
due to reducing maximum amount of memory held on pvec from 28MB (with
THP) to 56kB per CPU.

Suggested-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ming Li <mingli199x@qq.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Wang Sheng-Hui
f3a932baa7 mm: introduce dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all
This patch is based on https://patchwork.ozlabs.org/patch/574623/.

Tejun submitted commit 23d11a58a9 ("workqueue: skip flush dependency
checks for legacy workqueues") for the legacy create*_workqueue()
interface.

But some workq created by alloc_workqueue still reports warning on
memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set:

    workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c
    ...
    check_flush_dependency+0xb4/0x10c
    flush_work+0x54/0x140
    lru_add_drain_all+0x138/0x188
    migrate_prep+0xc/0x18
    alloc_contig_range+0xf4/0x350
    cma_alloc+0xec/0x1e4
    dma_alloc_from_contiguous+0x38/0x40
    __dma_alloc+0x74/0x25c
    nvme_alloc_queue+0xcc/0x36c
    nvme_reset_work+0x5c4/0xda8
    process_one_work+0x128/0x2ec
    worker_thread+0x58/0x434
    kthread+0xd4/0xe8
    ret_from_fork+0x10/0x50

That's because lru_add_drain_all() will schedule the drain work on
system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM.

Introduce a dedicated WQ_MEM_RECLAIM workqueue to do
lru_add_drain_all(), aiding in getting memory freed.

Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.com
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Ming Li
a4a921aa5c mm/swap.c: put activate_page_pvecs and other pagevecs together
Put the activate_page_pvecs definition next to those of the other
pagevecs, for clarity.

Signed-off-by: Ming Li <mingli199x@qq.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Kirill A. Shutemov
aa88b68c3b thp: keep huge zero page pinned until tlb flush
Andrea has found[1] a race condition on MMU-gather based TLB flush vs
split_huge_page() or shrinker which frees huge zero under us (patch 1/2
and 2/2 respectively).

With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
page pinned until flush is complete and the pin prevents the page from
being split under us.

We still need patch 2/2.  This is simplified version of Andrea's patch.
We don't need fancy encoding.

[1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Dan Williams
3565fce3a6 mm, x86: get_user_pages() for dax mappings
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Minchan Kim
10853a0392 mm: move lazily freed pages to inactive list
MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
them so there is no value keeping them in the active anonymous LRU so
this patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive
list are reclaimed first because they are more likely to be cold rather
than recently active pages.

An arguable issue for the approach would be whether we should put the
page to the head or tail of the inactive list.  I chose head because the
kernel cannot make sure it's really cold or warm for every MADV_FREE
usecase but at least we know it's not *hot*, so landing of inactive head
would be a comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily.  This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <yalin.wang2010@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
e90309c9f7 thp: allow mlocked THP again
Before THP refcounting rework, THP was not allowed to cross VMA
boundary.  So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.

With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
 1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
 2. the process does munlock() on the *part* of the THP:
      - the VMA is split into two, one of them VM_LOCKED;
      - huge PMD split into PTE table;
      - THP is still mlocked;
 3. split_huge_page():
      - it transfers PG_mlocked to *all* small pages regrardless if it
	blong to any VM_LOCKED VMA.

We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.

Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().

This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan.  After the split vmscan will detect
unevictable small pages and mlock them.

With this approach we shouldn't hit situation like described above.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00