kernel_xiaomi_sm8250

Author	SHA1	Message	Date
Ivaylo Georgiev	4c30d46517	Merge android-4.19.95 (`5da1114`) into msm-4.19 * refs/heads/tmp-5da1114: Revert crypto changes from android-4.19.79-95 Revert "UPSTREAM: PM / wakeup updates" Revert "ANDROID: of: property: Enable of_devlink by default" Revert "UPSTREAM: dt-bindings: arm: coresight: Add support for coresight-loses-context-with-cpu" UPSTREAM: net: usbnet: Fix -Wcast-function-type UPSTREAM: USB: dummy-hcd: use usb_urb_dir_in instead of usb_pipein UPSTREAM: USB: dummy-hcd: increase max number of devices to 32 ANDROID: tty: serdev: Fix broken serial console input ANDROID: update kernel ABI (perf_event changes) BACKPORT: perf_event: Add support for LSM and SELinux checks UPSTREAM: iommu: Allow io-pgtable to be used outside of drivers/iommu/ ANDROID: update abi for 4.19.94 release ANDROID: update abi due to revert Revert "BACKPORT: perf_event: Add support for LSM and SELinux checks" UPSTREAM: selinux: sidtab reverse lookup hash table UPSTREAM: selinux: avoid atomic_t usage in sidtab UPSTREAM: selinux: check sidtab limit before adding a new entry UPSTREAM: selinux: fix context string corruption in convert_context() UPSTREAM: selinux: overhaul sidtab to fix bug and improve performance UPSTREAM: selinux: refactor mls_context_to_sid() and make it stricter UPSTREAM: selinux: use separate table for initial SID lookup UPSTREAM: selinux: make "selinux_policycap_names[]" const char * UPSTREAM: selinux: refactor sidtab conversion ANDROID: Update ABI representation ANDROID: GKI: clk: Don't disable unused clocks with sync state support ANDROID: GKI: clk: Add support for clock providers with sync state ANDROID: GKI: driver core: Add dev_has_sync_state() ANDROID: update kernel ABI representation BACKPORT: perf_event: Add support for LSM and SELinux checks ANDROID: update ABI representation UPSTREAM: exit: panic before exit_mm() on global init exit ANDROID: serdev: Fix platform device support ANDROID: Kconfig.gki: Add Hidden SPRD DRM configs ANDROID: gki_defconfig: Disable TRANSPARENT_HUGEPAGE ANDROID: gki_defconfig: Enable CONFIG_GNSS_CMDLINE_SERIAL ANDROID: gnss: Add command line test driver ANDROID: serdev: add platform device support ANDROID: gki_defconfig: enable ARM64_SW_TTBR0_PAN ANDROID: gki_defconfig: Set BINFMT_MISC as =m UPSTREAM: binder: fix incorrect calculation for num_valid ABI: Update ABI after f2fs merge ANDROID: add initial ABI whitelist for android-4.19 ANDROID: staging: android: ion: Fix build when CONFIG_ION_SYSTEM_HEAP=n ANDROID: staging: android: ion: Expose total heap and pool sizes via sysfs ANDROID: Update ABI representation due to vmstat counter changes UPSTREAM: include/linux/slab.h: fix sparse warning in kmalloc_type() UPSTREAM: mm, slab: shorten kmalloc cache names for large sizes UPSTREAM: mm, proc: add KReclaimable to /proc/meminfo UPSTREAM: mm: rename and change semantics of nr_indirectly_reclaimable_bytes UPSTREAM: dcache: allocate external names from reclaimable kmalloc caches UPSTREAM: mm, slab/slub: introduce kmalloc-reclaimable caches UPSTREAM: mm, slab: combine kmalloc_caches and kmalloc_dma_caches ANDROID: abi update for 4.19.89 ANDROID: update abi_gki_aarch64.xml for LTO, CFI, and SCS ANDROID: gki_defconfig: enable LTO, CFI, and SCS ANDROID: update abi_gki_aarch64.xml for CONFIG_GNSS ANDROID: cuttlefish_defconfig: Enable CONFIG_GNSS UPSTREAM: arm64: Validate tagged addresses in access_ok() called from kernel threads ANDROID: mm: Throttle rss_stat tracepoint UPSTREAM: mm: slub: really fix slab walking for init_on_free ANDROID: update abi_gki_aarch64.xml for nf change ANDROID: kbuild: limit LTO inlining ANDROID: kbuild: merge module sections with LTO ANDROID: netfilter: nf_nat: remove static from nf_nat_ipv4_fn UPSTREAM: drm/client: remove the exporting of drm_client_close ANDROID: f2fs: fix possible merge of unencrypted with encrypted I/O UPSTREAM: binder: Add binder_proc logging to binderfs UPSTREAM: binder: Make transaction_log available in binderfs UPSTREAM: binder: Add stats, state and transactions files UPSTREAM: binder: add a mount option to show global stats UPSTREAM: binder: Validate the default binderfs device names. UPSTREAM: binder: Add default binder devices through binderfs when configured UPSTREAM: binder: fix CONFIG_ANDROID_BINDER_DEVICES UPSTREAM: android: binder: use kstrdup instead of open-coding it UPSTREAM: binderfs: remove separate device_initcall() UPSTREAM: binderfs: respect limit on binder control creation UPSTREAM: binderfs: switch from d_add() to d_instantiate() UPSTREAM: binderfs: drop lock in binderfs_binder_ctl_create UPSTREAM: binderfs: kill_litter_super() before cleanup UPSTREAM: binderfs: rework binderfs_binder_device_create() UPSTREAM: binderfs: rework binderfs_fill_super() UPSTREAM: binderfs: prevent renaming the control dentry UPSTREAM: binderfs: remove outdated comment UPSTREAM: binderfs: fix error return code in binderfs_fill_super() UPSTREAM: binderfs: handle !CONFIG_IPC_NS builds UPSTREAM: binderfs: reserve devices for initial mount UPSTREAM: binderfs: rename header to binderfs.h UPSTREAM: binderfs: implement "max" mount option UPSTREAM: binderfs: make each binderfs mount a new instance UPSTREAM: binderfs: remove wrong kern_mount() call UPSTREAM: binder: implement binderfs UPSTREAM: binder: remove BINDER_DEBUG_ENTRY() ANDROID: Don't base allmodconfig on gki_defconfig ANDROID: Disable UNWINDER_ORC for allmodconfig ANDROID: update abi_gki_aarch64.xml for 4.19.87 BACKPORT: ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer ANDROID: update abi_gki_aarch64.xml ANDROID: gki_defconfig: =m's applied for virtio configs in arm64 UPSTREAM: of: property: Add device link support for interrupt-parent, dmas and -gpio(s) UPSTREAM: of: property: Add device link support for "iommu-map" UPSTREAM: of: property: Fix the semantics of of_is_ancestor_of() UPSTREAM: i2c: of: Populate fwnode in of_i2c_get_board_info() UPSTREAM: driver core: Clarify documentation for fwnode_operations.add_links() UPSTREAM: dt-bindings: arm: coresight: Add support for coresight-loses-context-with-cpu BACKPORT: coresight: etm4x: Save/restore state across CPU low power states ANDROID: Update ABI representation ANDROID: gki_defconfig: IIO=y f2fs: stop GC when the victim becomes fully valid f2fs: expose main_blkaddr in sysfs f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project() f2fs: Fix deadlock in f2fs_gc() context during atomic files handling f2fs: show f2fs instance in printk_ratelimited f2fs: fix potential overflow f2fs: fix to update dir's i_pino during cross_rename f2fs: support aligned pinned file f2fs: avoid kernel panic on corruption test f2fs: fix wrong description in document f2fs: cache global IPU bio f2fs: fix to avoid memory leakage in f2fs_listxattr f2fs: check total_segments from devices in raw_super f2fs: update multi-dev metadata in resize_fs f2fs: mark recovery flag correctly in read_raw_super_block() f2fs: fix to update time in lazytime mode vfs: don't allow writes to swap files mm: set S_SWAPFILE on blockdev swap devices BACKPORT: ARM: 8900/1: UNWINDER_FRAME_POINTER implementation for Clang ANDROID: update abi_gki_aarch64.xml for 4.19.87 ANDROID: gki_defconfig: FW_CACHE to no FROMGIT: firmware_class: make firmware caching configurable FROMLIST: arm64: implement Shadow Call Stack FROMLIST: arm64: disable SCS for hypervisor code BACKPORT: FROMLIST: arm64: vdso: disable Shadow Call Stack FROMLIST: arm64: efi: restore x18 if it was corrupted FROMLIST: arm64: preserve x18 when CPU is suspended FROMLIST: arm64: reserve x18 from general allocation with SCS FROMLIST: arm64: disable function graph tracing with SCS FROMLIST: scs: add support for stack usage debugging FROMLIST: scs: add accounting FROMLIST: add support for Clang's Shadow Call Stack (SCS) FROMLIST: arm64: kernel: avoid x18 in __cpu_soft_restart FROMLIST: arm64: kvm: stop treating register x18 as caller save FROMLIST: arm64/lib: copy_page: avoid x18 register in assembler code FROMLIST: arm64: mm: avoid x18 in idmap_kpti_install_ng_mappings ANDROID: use non-canonical CFI jump tables ANDROID: arm64: add __nocfi to __apply_alternatives ANDROID: arm64: add __pa_function ANDROID: arm64: allow ThinLTO to be selected ANDROID: soc/tegra: disable ARCH_TEGRA_210_SOC with LTO FROMLIST: arm64: fix alternatives with LLVM's integrated assembler ANDROID: irqchip/gic-v3: rename gic_of_init to work around a ThinLTO+CFI bug ANDROID: init: ensure initcall ordering with LTO Revert "ANDROID: init: ensure initcall ordering with LTO" ANDROID: add support for ThinLTO ANDROID: clang: update to 10.0.1 ANDROID: gki_defconfig: enable CONFIG_REGULATOR_FIXED_VOLTAGE ANDROID: gki_defconfig: removed CONFIG_PM_WAKELOCKS ANDROID: gki_defconfig: enable CONFIG_IKHEADERS as m FROMGIT: pinctrl: devicetree: Avoid taking direct reference to device name string ANDROID: update abi_gki_aarch64.xml for 4.19.86 update ANDROID: Update ABI representation ANDROID: gki_defconfig: disable FUNCTION_TRACER ANDROID: Update the ABI representation ANDROID: update ABI representation ANDROID: add unstripped modules to the distribution FROMLIST: vsprintf: Inline call to ptr_to_hashval UPSTREAM: rss_stat: Add support to detect RSS updates of external mm UPSTREAM: mm: emit tracepoint when RSS changes FROMGIT: driver core: Allow device link operations inside sync_state() ANDROID: uid_sys_stats: avoid double accounting of dying threads ANDROID: scsi: ufs-qcom: Enable BROKEN_CRYPTO quirk flag ANDROID: scsi: ufs-hisi: Enable BROKEN_CRYPTO quirk flag ANDROID: scsi: ufs: Add quirk bit for controllers that don't play well with inline crypto ANDROID: scsi: ufs: UFS init should not require inline crypto ANDROID: scsi: ufs: UFS crypto variant operations API ANDROID: gki_defconfig: enable inline encryption BACKPORT: FROMLIST: ext4: add inline encryption support BACKPORT: FROMLIST: f2fs: add inline encryption support BACKPORT: FROMLIST: fscrypt: add inline encryption support BACKPORT: FROMLIST: scsi: ufs: Add inline encryption support to UFS BACKPORT: FROMLIST: scsi: ufs: UFS crypto API BACKPORT: FROMLIST: scsi: ufs: UFS driver v2.1 spec crypto additions BACKPORT: FROMLIST: block: blk-crypto for Inline Encryption ANDROID: block: Fix bio_crypt_should_process WARN_ON BACKPORT: FROMLIST: block: Add encryption context to struct bio BACKPORT: FROMLIST: block: Keyslot Manager for Inline Encryption FROMLIST: f2fs: add support for IV_INO_LBLK_64 encryption policies FROMLIST: ext4: add support for IV_INO_LBLK_64 encryption policies BACKPORT: FROMLIST: fscrypt: add support for IV_INO_LBLK_64 policies FROMLIST: fscrypt: zeroize fscrypt_info before freeing FROMLIST: fscrypt: remove struct fscrypt_ctx BACKPORT: FROMLIST: fscrypt: invoke crypto API for ESSIV handling ANDROID: build kernels with llvm-nm and llvm-objcopy ANDROID: Fix allmodconfig build with CC=clang UPSTREAM: mm/page_poison: expose page_poisoning_enabled to kernel modules FROMGIT: of: property: Add device link support for iommus, mboxes and io-channels FROMGIT: of: property: Make it easy to add device links from DT properties FROMGIT: of: property: Minor style clean up of of_link_to_phandle() Revert "ANDROID: of/property: Add device link support for iommus" ANDROID: Add allmodconfig build.configs for x86_64 and aarch64 ANDROID: fix allmodconfig build ANDROID: nf: IDLETIMER: Fix possible use before initialization in idletimer_resume BACKPORT: coresight: funnel: Support static funnel BACKPORT:FROMGIT: coresight: replicator: Fix missing spin_lock_init() BACKPORT:FROMGIT: coresight: funnel: Fix missing spin_lock_init() BACKPORT:FROMGIT: coresight: Serialize enabling/disabling a link device. UPSTREAM: coresight: tmc-etr: Add barrier packets when moving offset forward UPSTREAM: coresight: tmc-etr: Decouple buffer sync and barrier packet insertion UPSTREAM: coresight: tmc: Make memory width mask computation into a function UPSTREAM: coresight: tmc-etr: Fix perf_data check UPSTREAM: coresight: tmc-etr: Fix updating buffer in not-snapshot mode. UPSTREAM: coresight: tmc-etr: Check if non-secure access is enabled UPSTREAM: coresight: tmc-etr: Handle memory errors BACKPORT: coresight: etr_buf: Consolidate refcount initialization UPSTREAM: coresight: Fix DEBUG_LOCKS_WARN_ON for uninitialized attribute UPSTREAM: coresight: Use coresight device names for sinks in PMU attribute UPSTREAM: coresight: tmc-etr: alloc_perf_buf: Do not call smp_processor_id from preemptible UPSTREAM: coresight: tmc-etr: Do not call smp_processor_id() from preemptible UPSTREAM: coresight: perf: Don't set the truncated flag in snapshot mode UPSTREAM: coresight: tmc-etf: Fix snapshot mode update function UPSTREAM: coresight: tmc-etr: Properly set AUX buffer head in snapshot mode UPSTREAM: coresight: tmc-etr: Add support for CPU-wide trace scenarios UPSTREAM: coresight: tmc-etr: Allocate and free ETR memory buffers for CPU-wide scenarios UPSTREAM: coresight: tmc-etr: Introduce the notion of IDR to ETR devices UPSTREAM: coresight: tmc-etr: Introduce the notion of reference counting to ETR devices UPSTREAM: coresight: tmc-etr: Introduce the notion of process ID to ETR devices UPSTREAM: coresight: tmc-etr: Create per-thread buffer allocation function UPSTREAM: coresight: tmc-etr: Refactor function tmc_etr_setup_perf_buf() UPSTREAM: coresight: Communicate perf event to sink buffer allocation functions UPSTREAM: coresight: perf: Refactor function free_event_data() UPSTREAM: coresight: perf: Clean up function etm_setup_aux() UPSTREAM: coresight: Properly address concurrency in sink::update() functions UPSTREAM: coresight: Properly address errors in sink::disable() functions UPSTREAM: coresight: Move reference counting inside sink drivers UPSTREAM: coresight: Adding return code to sink::disable() operation UPSTREAM: coresight: etm4x: Configure tracers to emit timestamps UPSTREAM: coresight: etm4x: Skip selector pair 0 UPSTREAM: coresight: etm4x: Add kernel configuration for CONTEXTID UPSTREAM: coresight: pmu: Adding ITRACE property to cs_etm PMU UPSTREAM: coresight: tmc: Cleanup power management UPSTREAM: coresight: Fix freeing up the coresight connections UPSTREAM: coresight: tmc: Report DMA setup failures UPSTREAM: coresight: catu: fix clang build warning UPSTREAM: perf/core: Fix the address filtering fix UPSTREAM: perf, pt, coresight: Fix address filters for vmas with non-zero offset UPSTREAM: perf: Copy parent's address filter offsets on clone UPSTREAM: coresight: Use event attributes for sink selection UPSTREAM: coresight: perf: Add "sinks" group to PMU directory UPSTREAM: coresight: etb10: Add support for CLAIM tag UPSTREAM: coreisght: tmc: Claim device before use UPSTREAM: coresight: dynamic-replicator: Claim device for use UPSTREAM: coresight: funnel: Claim devices before use UPSTREAM: coresight: etmx: Claim devices before use UPSTREAM: coresight: Add support for CLAIM tag protocol UPSTREAM: coresight: dynamic-replicator: Handle multiple connections UPSTREAM: coresight: etb10: Handle errors enabling the device UPSTREAM: coresight: etm3: Add support for handling errors UPSTREAM: coresight: etm4x: Add support for handling errors UPSTREAM: coresight: tmc-etb/etf: Prepare to handle errors enabling UPSTREAM: coresight: tmc-etr: Handle errors enabling CATU UPSTREAM: coresight: tmc-etr: Refactor for handling errors UPSTREAM: coresight: Handle failures in enabling a trace path UPSTREAM: coresight: tmc: Fix byte-address alignment for RRP UPSTREAM: coresight: etm4x: Configure EL2 exception level when kernel is running in HYP UPSTREAM: coresight: etb10: Splitting function etb_enable() UPSTREAM: coresight: etb10: Refactor etb_drvdata::mode handling UPSTREAM: coresight: etm-perf: Add support for ETR backend UPSTREAM: coresight: perf: Remove set_buffer call back UPSTREAM: coresight: perf: Add helper to retrieve sink configuration UPSTREAM: coresight: perf: Remove reset_buffer call back for sinks UPSTREAM: coresight: Convert driver messages to dev_dbg UPSTREAM: coresight: tmc-etr: Relax collection of trace from sysfs mode UPSTREAM: coresight: tmc-etr: Handle driver mode specific ETR buffers UPSTREAM: coresight: perf: Disable trace path upon source error UPSTREAM: coresight: perf: Allow tracing on hotplugged CPUs UPSTREAM: coresight: perf: Avoid unncessary CPU hotplug read lock UPSTREAM: coresight: perf: Fix per cpu path management UPSTREAM: coresight: Fix handling of sinks UPSTREAM: coresight: Use ERR_CAST instead of ERR_PTR UPSTREAM: coresight: Fix remote endpoint parsing UPSTREAM: coresight: platform: Fix leaking device reference UPSTREAM: coresight: platform: Fix refcounting for graph nodes UPSTREAM: coresight: platform: Refactor graph endpoint parsing UPSTREAM: coresight: Document error handling in coresight_register ANDROID: regression introduced override_creds=off ANDROID: overlayfs: internal getxattr operations without sepolicy checking ANDROID: overlayfs: add __get xattr method ANDROID: Add optional __get xattr method paired to __vfs_getxattr UPSTREAM: scsi: ufs: override auto suspend tunables for ufs UPSTREAM: scsi: core: allow auto suspend override by low-level driver FROMGIT: of: property: Skip adding device links to suppliers that aren't devices ANDROID: gki_defconfig: enable CONFIG_KEYBOARD_GPIO UPSTREAM: dm bufio: introduce a global cache replacement UPSTREAM: dm bufio: remove old-style buffer cleanup UPSTREAM: dm bufio: introduce a global queue UPSTREAM: dm bufio: refactor adjust_total_allocated UPSTREAM: dm bufio: call adjust_total_allocated from __link_buffer and __unlink_buffer ANDROID: dummy_cpufreq: Implement get() ANDROID: gki_defconfig: enable CONFIG_CPUSETS ANDROID: virtio: virtio_input: Set the amount of multitouch slots in virtio input rtlwifi: Fix potential overflow on P2P code ANDROID: cpufreq: create dummy cpufreq driver ANDROID: Allow DRM_IOCTL_MODE__DUMB for render clients. Cuttlefish Wifi: Add data ops in virt_wifi driver for scan data simulation ANDROID: of: property: Enable of_devlink by default ANDROID: of: property: Make sure child dependencies don't block probing of parent ANDROID: driver core: Allow fwnode_operations.add_links to differentiate errors ANDROID: driver core: Allow a device to wait on optional suppliers ANDROID: driver core: Add device link support for SYNC_STATE_ONLY flag FROMGIT: docs: driver-model: Add documentation for sync_state FROMGIT: driver: core: Improve documentation for fwnode_operations.add_links() FROMGIT: of: property: Minor code formatting/style clean ups ANDROID: of/property: Add device link support for iommus ANDROID: move up spin_unlock_bh() ahead of remove_proc_entry() BACKPORT: arm64: tags: Preserve tags for addresses translated via TTBR1 UPSTREAM: arm64: memory: Implement __tag_set() as common function UPSTREAM: arm64/mm: fix variable 'tag' set but not used UPSTREAM: arm64: avoid clang warning about self-assignment ANDROID: sdcardfs: evict dentries on fscrypt key removal ANDROID: fscrypt: add key removal notifier chain ANDROID: refactor build.config files to remove duplication ANDROID: Move from clang r353983c to r365631c ANDROID: gki_defconfig: remove PWRSEQ_EMMC and PWRSEQ_SIMPLE ANDROID: unconditionally compile sig_ok in struct module ANDROID: gki_defconfig: enable fs-verity UPSTREAM: mm: vmalloc: show number of vmalloc pages in /proc/meminfo BACKPORT: PM/sleep: Expose suspend stats in sysfs UPSTREAM: power: supply: Init device wakeup after device_add() UPSTREAM: PM / wakeup: Unexport wakeup_source_sysfs_{add,remove}() UPSTREAM: PM / wakeup: Register wakeup class kobj after device is added UPSTREAM: PM / wakeup: Fix sysfs registration error path UPSTREAM: PM / wakeup: Show wakeup sources stats in sysfs UPSTREAM: PM / wakeup: Use wakeup_source_register() in wakelock.c UPSTREAM: PM / wakeup: Drop wakeup_source_init(), wakeup_source_prepare() UPSTREAM: PM / wakeup: Drop wakeup_source_drop() UPSTREAM: PM / core: Add support to skip power management in device/driver model gki_defconfig: Enable CONFIG_DM_SNAPSHOT ANDROID: gki_defconfig: enable accelerated AES and SHA-256 ANDROID: fix overflow in /proc/uid_cputime/remove_uid_range ANDROID: kasan: fix has_attribute check on older GCC versions ANDROID: gki_defconfig: enable CONFIG_PARAVIRT and CONFIG_HYPERVISOR_GUEST ANDROID: gki_defconfig: enable CONFIG_NLS_ ANDROID: gki_defconfig: Enable BPF_JIT and BPF_JIT_ALWAYS_ON FROMGIT: of: property: Create device links for all child-supplier depencencies FROMGIT: of/platform: Pause/resume sync state during init and of_platform_populate() BACKPORT: FROMGIT: driver core: Add sync_state driver/bus callback BACKPORT: FROMGIT: of: property: Add functional dependency link from DT bindings FROMGIT: driver core: Add support for linking devices during device addition FROMGIT: driver core: Add fwnode_to_dev() to look up device from fwnode UPSTREAM: mm: untag user pointers in mmap/munmap/mremap/brk UPSTREAM: vfio/type1: untag user pointers in vaddr_get_pfn UPSTREAM: tee/shm: untag user pointers in tee_shm_register UPSTREAM: media/v4l2-core: untag user pointers in videobuf_dma_contig_user_get UPSTREAM: drm/radeon: untag user pointers in radeon_gem_userptr_ioctl BACKPORT: drm/amdgpu: untag user pointers UPSTREAM: userfaultfd: untag user pointers UPSTREAM: fs/namespace: untag user pointers in copy_mount_options UPSTREAM: mm: untag user pointers in get_vaddr_frames UPSTREAM: mm: untag user pointers in mm/gup.c UPSTREAM: mm: untag user pointers passed to memory syscalls BACKPORT: lib: untag user pointers in strn_user UPSTREAM: arm64: Fix reference to docs for ARM64_TAGGED_ADDR_ABI UPSTREAM: selftests, arm64: add kernel headers path for tags_test BACKPORT: arm64: Relax Documentation/arm64/tagged-pointers.rst UPSTREAM: arm64: Define Documentation/arm64/tagged-address-abi.rst UPSTREAM: arm64: Change the tagged_addr sysctl control semantics to only prevent the opt-in UPSTREAM: arm64: Tighten the PR_{SET, GET}_TAGGED_ADDR_CTRL prctl() unused arguments UPSTREAM: selftests, arm64: fix uninitialized symbol in tags_test.c UPSTREAM: arm64: mm: Really fix sparse warning in untagged_addr() UPSTREAM: selftests, arm64: add a selftest for passing tagged pointers to kernel BACKPORT: arm64: Introduce prctl() options to control the tagged user addresses ABI UPSTREAM: arm64: untag user pointers in access_ok and __uaccess_mask_ptr UPSTREAM: uaccess: add noop untagged_addr definition BACKPORT: block: annotate refault stalls from IO submission f2fs: add a condition to detect overflow in f2fs_ioc_gc_range() f2fs: fix to add missing F2FS_IO_ALIGNED() condition f2fs: fix to fallback to buffered IO in IO aligned mode f2fs: fix to handle error path correctly in f2fs_map_blocks f2fs: fix extent corrupotion during directIO in LFS mode f2fs: check all the data segments against all node ones f2fs: Add a small clarification to CONFIG_FS_F2FS_FS_SECURITY f2fs: fix inode rwsem regression f2fs: fix to avoid accessing uninitialized field of inode page in is_alive() f2fs: avoid infinite GC loop due to stale atomic files f2fs: Fix indefinite loop in f2fs_gc() f2fs: convert inline_data in prior to i_size_write f2fs: fix error path of f2fs_convert_inline_page() f2fs: add missing documents of reserve_root/resuid/resgid f2fs: fix flushing node pages when checkpoint is disabled f2fs: enhance f2fs_is_checkpoint_ready()'s readability f2fs: clean up __bio_alloc()'s parameter f2fs: fix wrong error injection path in inc_valid_block_count() f2fs: fix to writeout dirty inode during node flush f2fs: optimize case-insensitive lookups f2fs: introduce f2fs_match_name() for cleanup f2fs: Fix indefinite loop in f2fs_gc() f2fs: allocate memory in batch in build_sit_info() f2fs: support FS_IOC_{GET,SET}FSLABEL f2fs: fix to avoid data corruption by forbidding SSR overwrite f2fs: Fix build error while CONFIG_NLS=m Revert "f2fs: avoid out-of-range memory access" f2fs: cleanup the code in build_sit_entries. f2fs: fix wrong available node count calculation f2fs: remove duplicate code in f2fs_file_write_iter f2fs: fix to migrate blocks correctly during defragment f2fs: use wrapped f2fs_cp_error() f2fs: fix to use more generic EOPNOTSUPP f2fs: use wrapped IS_SWAPFILE() f2fs: Support case-insensitive file name lookups f2fs: include charset encoding information in the superblock fs: Reserve flag for casefolding f2fs: fix to avoid call kvfree under spinlock fs: f2fs: Remove unnecessary checks of SM_I(sbi) in update_general_status() f2fs: disallow direct IO in atomic write f2fs: fix to handle quota_{on,off} correctly f2fs: fix to detect cp error in f2fs_setxattr() f2fs: fix to spread f2fs_is_checkpoint_ready() f2fs: support fiemap() for directory inode f2fs: fix to avoid discard command leak f2fs: fix to avoid tagging SBI_QUOTA_NEED_REPAIR incorrectly f2fs: fix to drop meta/node pages during umount f2fs: disallow switching io_bits option during remount f2fs: fix panic of IO alignment feature f2fs: introduce {page,io}_is_mergeable() for readability f2fs: fix livelock in swapfile writes f2fs: add fs-verity support ext4: update on-disk format documentation for fs-verity ext4: add fs-verity read support ext4: add basic fs-verity support fs-verity: support builtin file signatures fs-verity: add SHA-512 support fs-verity: implement FS_IOC_MEASURE_VERITY ioctl fs-verity: implement FS_IOC_ENABLE_VERITY ioctl fs-verity: add data verification hooks for ->readpages() fs-verity: add the hook for file ->setattr() fs-verity: add the hook for file ->open() fs-verity: add inode and superblock fields fs-verity: add Kconfig and the helper functions for hashing fs: uapi: define verity bit for FS_IOC_GETFLAGS fs-verity: add UAPI header fs-verity: add MAINTAINERS file entry fs-verity: add a documentation file ext4: fix kernel oops caused by spurious casefold flag ext4: fix coverity warning on error path of filename setup ext4: optimize case-insensitive lookups ext4: fix dcache lookup of !casefolded directories unicode: update to Unicode 12.1.0 final unicode: add missing check for an error return from utf8lookup() ext4: export /sys/fs/ext4/feature/casefold if Unicode support is present unicode: refactor the rule for regenerating utf8data.h ext4: Support case-insensitive file name lookups ext4: include charset encoding information in the superblock unicode: update unicode database unicode version 12.1.0 unicode: introduce test module for normalized utf8 implementation unicode: implement higher level API for string handling unicode: reduce the size of utf8data[] unicode: introduce code for UTF-8 normalization unicode: introduce UTF-8 character database ext4 crypto: fix to check feature status before get policy fscrypt: document the new ioctls and policy version ubifs: wire up new fscrypt ioctls f2fs: wire up new fscrypt ioctls ext4: wire up new fscrypt ioctls fscrypt: require that key be added when setting a v2 encryption policy fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl fscrypt: allow unprivileged users to add/remove keys for v2 policies fscrypt: v2 encryption policy support fscrypt: add an HKDF-SHA512 implementation fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl fscrypt: rename keyinfo.c to keysetup.c fscrypt: move v1 policy key setup to keysetup_v1.c fscrypt: refactor key setup code in preparation for v2 policies fscrypt: rename fscrypt_master_key to fscrypt_direct_key fscrypt: add ->ci_inode to fscrypt_info fscrypt: use FSCRYPT_ definitions, not FS_* fscrypt: use FSCRYPT_ prefix for uapi constants fs, fscrypt: move uapi definitions to new header <linux/fscrypt.h> fscrypt: use ENOPKG when crypto API support missing fscrypt: improve warnings for missing crypto API support fscrypt: improve warning messages for unsupported encryption contexts fscrypt: make fscrypt_msg() take inode instead of super_block fscrypt: clean up base64 encoding/decoding fscrypt: remove loadable module related code Updated following files to fix build errors: drivers/gpu/msm/kgsl_pool.c drivers/hwtracing/coresight/coresight-dummy.c drivers/iommu/dma-mapping-fast.c drivers/iommu/io-pgtable-fast.c drivers/iommu/io-pgtable-msm-secure.c kernel/taskstats.c mm/vmalloc.c security/selinux/ss/sidtab.h Conflicts: arch/arm/Makefile arch/arm64/Kconfig arch/x86/include/asm/syscall_wrapper.h build.config.common drivers/clk/clk.c drivers/hwtracing/coresight/coresight-etm-perf.c drivers/hwtracing/coresight/coresight-funnel.c drivers/hwtracing/coresight/coresight-tmc-etf.c drivers/hwtracing/coresight/coresight-tmc-etr.c drivers/hwtracing/coresight/coresight-tmc.c drivers/hwtracing/coresight/coresight-tmc.h drivers/hwtracing/coresight/coresight.c drivers/hwtracing/coresight/of_coresight.c drivers/iommu/arm-smmu.c drivers/iommu/io-pgtable-arm.c drivers/iommu/io-pgtable.c drivers/scsi/scsi_sysfs.c drivers/scsi/sd.c drivers/scsi/ufs/ufshcd.c drivers/scsi/ufs/ufshcd.h drivers/staging/android/ion/ion.c drivers/staging/android/ion/ion.h drivers/staging/android/ion/ion_page_pool.c fs/ext4/readpage.c fs/f2fs/data.c fs/f2fs/f2fs.h fs/f2fs/file.c fs/f2fs/segment.c fs/f2fs/super.c include/linux/clk-provider.h include/linux/compiler_types.h include/linux/coresight.h include/linux/mmzone.h include/scsi/scsi_device.h include/trace/events/kmem.h kernel/events/core.c kernel/sched/core.c mm/vmstat.c Change-Id: I2eca52b08b484f2b5c30437671cab8cb0195b8d6 Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>	2020-03-27 10:48:20 -07:00
Andrey Konovalov	690c4ca8a5	UPSTREAM: mm: untag user pointers passed to memory syscalls (Upstream commit 057d3389108eda8a20c7f496f011846932680d88). This patch is a part of a series that extends kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments. This patch allows tagged pointers to be passed to the following memory syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect, mremap, msync, munlock, move_pages. The mmap and mremap syscalls do not currently accept tagged addresses. Architectures may interpret the tag as a background colour for the corresponding vma. Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jens Wiklander <jens.wiklander@linaro.org> Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Bug: 135692346 Change-Id: I1a2d89eedb45e618e85ca515f4c9121460711efb	2019-10-07 15:27:40 -04:00
Ivaylo Georgiev	053efa6d4c	Merge android-4.19.58 (`5ad6eeb`) into msm-4.19 * refs/heads/tmp-5ad6eeb: Linux 4.19.58 dmaengine: imx-sdma: remove BD_INTR for channel0 dmaengine: qcom: bam_dma: Fix completed descriptors count MIPS: have "plain" make calls build dtbs for selected platforms MIPS: Add missing EHB in mtc0 -> mfc0 sequence. MIPS: Fix bounds check virt_addr_valid svcrdma: Ignore source port when computing DRC hash nfsd: Fix overflow causing non-working mounts on 1 TB machines KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC KVM: x86: degrade WARN to pr_warn_ratelimited netfilter: ipv6: nf_defrag: accept duplicate fragments again bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K net: hns: fix unsigned comparison to less than zero sc16is7xx: move label 'err_spi' to correct section netfilter: ipv6: nf_defrag: fix leakage of unqueued fragments ip6: fix skb leak in ip6frag_expire_frag_queue() rds: Fix warning. ALSA: hda: Initialize power_state field properly net: hns: Fixes the missing put_device in positive leg for roce reset x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting selftests: fib_rule_tests: Fix icmp proto with ipv6 scsi: tcmu: fix use after free mac80211: mesh: fix missing unlock on error in table_path_del() f2fs: don't access node/meta inode mapping after iput drm/fb-helper: generic: Don't take module ref for fbcon media: s5p-mfc: fix incorrect bus assignment in virtual child device net/smc: move unhash before release of clcsock mlxsw: spectrum: Handle VLAN device unlinking tty: rocket: fix incorrect forward declaration of 'rp_init()' btrfs: Ensure replaced device doesn't have pending chunk allocation mm/vmscan.c: prevent useless kswapd loops ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() drm/imx: only send event on crtc disable if kept disabled drm/imx: notify drm core before sending event during crtc disable drm/etnaviv: add missing failure path to destroy suballoc drm/amdgpu/gfx9: use reset default for PA_SC_FIFO_SIZE drm/amd/powerplay: use hardware fan control if no powerplay fan table arm64: kaslr: keep modules inside module region when KASAN is enabled ARM: dts: armada-xp-98dx3236: Switch to armada-38x-uart serial node tracing/snapshot: Resize spare buffer if size changed fs/userfaultfd.c: disable irqs for fault_pending and event locks lib/mpi: Fix karactx leak in mpi_powm ALSA: hda/realtek - Change front mic location for Lenovo M710q ALSA: hda/realtek: Add quirks for several Clevo notebook barebones ALSA: usb-audio: fix sign unintended sign extension on left shifts ALSA: line6: Fix write on zero-sized buffer ALSA: firewire-lib/fireworks: fix miss detection of received MIDI messages ALSA: seq: fix incorrect order of dest_client/dest_ports arguments crypto: cryptd - Fix skcipher instance memory leak crypto: user - prevent operating on larval algorithms ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME drm/i915/dmc: protect against reading random memory ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper() module: Fix livepatch/ftrace module text permissions race tracing: avoid build warning with HAVE_NOP_MCOUNT mm/mlock.c: change count_mm_mlocked_page_nr return type scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE cpuset: restore sanity to cpuset_cpus_allowed_fallback() i2c: pca-platform: Fix GPIO lookup code platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration platform/x86: intel-vbtn: Report switch events when event wakes device platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi drm: panel-orientation-quirks: Add quirk for GPD MicroPC drm: panel-orientation-quirks: Add quirk for GPD pocket2 scsi: hpsa: correct ioaccel2 chaining SoC: rt274: Fix internal jack assignment in set_jack callback ALSA: hdac: fix memory release for SST and SOF drivers usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i] x86/CPU: Add more Icelake model numbers ASoC: sun4i-i2s: Add offset to RX channel select ASoC: sun4i-i2s: Fix sun8i tx channel offset mask ASoC: max98090: remove 24-bit format support if RJ is 0 drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable() drm/mediatek: clear num_pipes when unbind driver drm/mediatek: call drm_atomic_helper_shutdown() when unbinding driver drm/mediatek: unbind components in mtk_drm_unbind() drm/mediatek: fix unbind functions spi: bitbang: Fix NULL pointer dereference in spi_unregister_master ASoC: ak4458: rstn_control - return a non-zero on error only ASoC: soc-pcm: BE dai needs prepare when pause release after resume ASoC: ak4458: add return value for ak4458_probe ASoC : cs4265 : readable register too low netfilter: nft_flow_offload: IPCB is only valid for ipv4 family netfilter: nft_flow_offload: don't offload when sequence numbers need adjustment netfilter: nft_flow_offload: set liberal tracking mode for tcp netfilter: nf_flow_table: ignore DF bit setting md/raid0: Do not bypass blocking queue entered for raid0 bios block: Fix a NULL pointer dereference in generic_make_request() Bluetooth: Fix faulty expression for minimum encryption key size check Change-Id: I42f3ee04258a2392be42269d05d99dc0b02a9feb Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>	2019-08-13 01:19:52 -07:00
Greg Kroah-Hartman	5ad6eeba58	Merge 4.19.58 into android-4.19 Changes in 4.19.58 Bluetooth: Fix faulty expression for minimum encryption key size check block: Fix a NULL pointer dereference in generic_make_request() md/raid0: Do not bypass blocking queue entered for raid0 bios netfilter: nf_flow_table: ignore DF bit setting netfilter: nft_flow_offload: set liberal tracking mode for tcp netfilter: nft_flow_offload: don't offload when sequence numbers need adjustment netfilter: nft_flow_offload: IPCB is only valid for ipv4 family ASoC : cs4265 : readable register too low ASoC: ak4458: add return value for ak4458_probe ASoC: soc-pcm: BE dai needs prepare when pause release after resume ASoC: ak4458: rstn_control - return a non-zero on error only spi: bitbang: Fix NULL pointer dereference in spi_unregister_master drm/mediatek: fix unbind functions drm/mediatek: unbind components in mtk_drm_unbind() drm/mediatek: call drm_atomic_helper_shutdown() when unbinding driver drm/mediatek: clear num_pipes when unbind driver drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable() ASoC: max98090: remove 24-bit format support if RJ is 0 ASoC: sun4i-i2s: Fix sun8i tx channel offset mask ASoC: sun4i-i2s: Add offset to RX channel select x86/CPU: Add more Icelake model numbers usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i] usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC ALSA: hdac: fix memory release for SST and SOF drivers SoC: rt274: Fix internal jack assignment in set_jack callback scsi: hpsa: correct ioaccel2 chaining drm: panel-orientation-quirks: Add quirk for GPD pocket2 drm: panel-orientation-quirks: Add quirk for GPD MicroPC platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi platform/x86: intel-vbtn: Report switch events when event wakes device platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow i2c: pca-platform: Fix GPIO lookup code cpuset: restore sanity to cpuset_cpus_allowed_fallback() scripts/decode_stacktrace.sh: prefix addr2line with $CROSS_COMPILE mm/mlock.c: change count_mm_mlocked_page_nr return type tracing: avoid build warning with HAVE_NOP_MCOUNT module: Fix livepatch/ftrace module text permissions race ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper() drm/i915/dmc: protect against reading random memory ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME crypto: user - prevent operating on larval algorithms crypto: cryptd - Fix skcipher instance memory leak ALSA: seq: fix incorrect order of dest_client/dest_ports arguments ALSA: firewire-lib/fireworks: fix miss detection of received MIDI messages ALSA: line6: Fix write on zero-sized buffer ALSA: usb-audio: fix sign unintended sign extension on left shifts ALSA: hda/realtek: Add quirks for several Clevo notebook barebones ALSA: hda/realtek - Change front mic location for Lenovo M710q lib/mpi: Fix karactx leak in mpi_powm fs/userfaultfd.c: disable irqs for fault_pending and event locks tracing/snapshot: Resize spare buffer if size changed ARM: dts: armada-xp-98dx3236: Switch to armada-38x-uart serial node arm64: kaslr: keep modules inside module region when KASAN is enabled drm/amd/powerplay: use hardware fan control if no powerplay fan table drm/amdgpu/gfx9: use reset default for PA_SC_FIFO_SIZE drm/etnaviv: add missing failure path to destroy suballoc drm/imx: notify drm core before sending event during crtc disable drm/imx: only send event on crtc disable if kept disabled ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() mm/vmscan.c: prevent useless kswapd loops btrfs: Ensure replaced device doesn't have pending chunk allocation tty: rocket: fix incorrect forward declaration of 'rp_init()' mlxsw: spectrum: Handle VLAN device unlinking net/smc: move unhash before release of clcsock media: s5p-mfc: fix incorrect bus assignment in virtual child device drm/fb-helper: generic: Don't take module ref for fbcon f2fs: don't access node/meta inode mapping after iput mac80211: mesh: fix missing unlock on error in table_path_del() scsi: tcmu: fix use after free selftests: fib_rule_tests: Fix icmp proto with ipv6 x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting net: hns: Fixes the missing put_device in positive leg for roce reset ALSA: hda: Initialize power_state field properly rds: Fix warning. ip6: fix skb leak in ip6frag_expire_frag_queue() netfilter: ipv6: nf_defrag: fix leakage of unqueued fragments sc16is7xx: move label 'err_spi' to correct section net: hns: fix unsigned comparison to less than zero bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K netfilter: ipv6: nf_defrag: accept duplicate fragments again KVM: x86: degrade WARN to pr_warn_ratelimited KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC nfsd: Fix overflow causing non-working mounts on 1 TB machines svcrdma: Ignore source port when computing DRC hash MIPS: Fix bounds check virt_addr_valid MIPS: Add missing EHB in mtc0 -> mfc0 sequence. MIPS: have "plain" make calls build dtbs for selected platforms dmaengine: qcom: bam_dma: Fix completed descriptors count dmaengine: imx-sdma: remove BD_INTR for channel0 Linux 4.19.58 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-07-10 11:40:00 +02:00
swkhack	79fccb9815	mm/mlock.c: change count_mm_mlocked_page_nr return type [ Upstream commit 0874bb49bb21bf24deda853e8bf61b8325e24bcb ] On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result will be wrong. So change the local variable and return value to unsigned long to fix the problem. Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com Fixes: `0cf2f6f6dc` ("mm: mlock: check against vma for actual mlock() size") Signed-off-by: swkhack <swkhack@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-10 09:53:40 +02:00
Laurent Dufour	3cfc37dc2b	mm: protect VMA modifications using VMA sequence count The VMA sequence count has been introduced to allow fast detection of VMA modification when running a page fault handler without holding the mmap_sem. This patch provides protection against the VMA modification done in : - madvise() - mpol_rebind_policy() - vma_replace_policy() - change_prot_numa() - mlock(), munlock() - mprotect() - mmap_region() - collapse_huge_page() - userfaultd registering services In addition, VMA fields which will be read during the speculative fault path needs to be written using WRITE_ONCE to prevent write to be split and intermediate values to be pushed to other CPUs. Change-Id: Ic36046b7254e538b6baf7144c50ae577ee7f2074 Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:15 [vinmenon@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> [charante@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2019-03-29 03:08:43 -07:00
Colin Cross	533e4ed309	ANDROID: mm: add a field to store names for private anonymous memory Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Includes fix from Jed Davis <jld@mozilla.com> for typo in prctl_set_vma_anon_name, which could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Bug: 120441514 Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f Signed-off-by: Dmitry Shmidt <dimitrysh@google.com> [AmitP: Fix get_user_pages_remote() call to align with upstream commit `5b56d49fc3` ("mm: add locked parameter to get_user_pages_remote()")] Signed-off-by: Amit Pundir <amit.pundir@linaro.org>	2018-12-05 09:48:11 -08:00
Colin Cross	62dd4637d6	ANDROID: mm: add a field to store names for private anonymous memory Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Includes fix from Jed Davis <jld@mozilla.com> for typo in prctl_set_vma_anon_name, which could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f Signed-off-by: Dmitry Shmidt <dimitrysh@google.com> [AmitP: Fix get_user_pages_remote() call to align with upstream commit `5b56d49fc3` ("mm: add locked parameter to get_user_pages_remote()")] Signed-off-by: Amit Pundir <amit.pundir@linaro.org>	2018-08-28 17:10:42 +05:30
Dave Jiang	e1fb4a0864	dax: remove VM_MIXEDMAP for fsdax and device dax This patch is reworked from an earlier patch that Dan has posted: https://patchwork.kernel.org/patch/10131727/ VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that the memory page it is dealing with is not typical memory from the linear map. The get_user_pages_fast() path, since it does not resolve the vma, is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we use that as a VM_MIXEDMAP replacement in some locations. In the cases where there is no pte to consult we fallback to using vma_is_dax() to detect the VM_MIXEDMAP special case. Now that we have explicit driver pfn_t-flag opt-in/opt-out for get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This also means we no longer need to worry about safely manipulating vm_flags in a future where we support dynamically changing the dax mode of a file. DAX should also now be supported with madvise_behavior(), vma_merge(), and copy_page_range(). This patch has been tested against ndctl unit test. It has also been tested against xfstests commit: 625515d using fake pmem created by memmap and no additional issues have been observed. Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-17 16:20:27 -07:00
Shakeel Butt	9c4e6b1a70	mm, mlock, vmscan: no more skipping pagevecs When a thread mlocks an address space backed either by file pages which are currently not present in memory or swapped out anon pages (not in swapcache), a new page is allocated and added to the local pagevec (lru_add_pvec), I/O is triggered and the thread then sleeps on the page. On I/O completion, the thread can wake on a different CPU, the mlock syscall will then sets the PageMlocked() bit of the page but will not be able to put that page in unevictable LRU as the page is on the pagevec of a different CPU. Even on drain, that page will go to evictable LRU because the PageMlocked() bit is not checked on pagevec drain. The page will eventually go to right LRU on reclaim but the LRU stats will remain skewed for a long time. This patch puts all the pages, even unevictable, to the pagevecs and on the drain, the pages will be added on their LRUs correctly by checking their evictability. This resolves the mlocked pages on pagevec of other CPUs issue because when those pagevecs will be drained, the mlocked file pages will go to unevictable LRU. Also this makes the race with munlock easier to resolve because the pagevec drains happen in LRU lock. However there is still one place which makes a page evictable and does PageLRU check on that page without LRU lock and needs special attention. TestClearPageMlocked() and isolate_lru_page() in clear_page_mlock(). #0: __pagevec_lru_add_fn #1: clear_page_mlock SetPageLRU() if (!TestClearPageMlocked()) return smp_mb() // <--required // inside does PageLRU if (!PageMlocked()) if (isolate_lru_page()) move to evictable LRU putback_lru_page() else move to unevictable LRU In '#1', TestClearPageMlocked() provides full memory barrier semantics and thus the PageLRU check (inside isolate_lru_page) can not be reordered before it. In '#0', without explicit memory barrier, the PageMlocked() check can be reordered before SetPageLRU(). If that happens, '#0' can put a page in unevictable LRU and '#1' might have just cleared the Mlocked bit of that page but fails to isolate as PageLRU fails as '#0' still hasn't set PageLRU bit of that page. That page will be stranded on the unevictable LRU. There is one (good) side effect though. Without this patch, the pages allocated for System V shared memory segment are added to evictable LRUs even after shmctl(SHM_LOCK) on that segment. This patch will correctly put such pages to unevictable LRU. Link: http://lkml.kernel.org/r/20171121211241.18877-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Jan Kara <jack@suse.cz> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-02-21 15:35:42 -08:00
Mike Rapoport	b7701a5f2e	mm: docs: fixup punctuation so that kernel-doc will properly recognize the parameter and function descriptions. Link: http://lkml.kernel.org/r/1516700871-22279-2-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-02-06 18:32:48 -08:00
Paul E. McKenney	50d4fb7812	mm: Eliminate cond_resched_rcu_qs() in favor of cond_resched() Now that cond_resched() also provides RCU quiescent states when needed, it can be used in place of cond_resched_rcu_qs(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz>	2017-11-28 16:00:28 -08:00
Shakeel Butt	72b03fcd5d	mm: mlock: remove lru_add_drain_all() lru_add_drain_all() is not required by mlock() and it will drain everything that has been cached at the time mlock is called. And that is not really related to the memory which will be faulted in (and cached) and mlocked by the syscall itself. If anything lru_add_drain_all() should be called _after_ pages have been mlocked and faulted in but even that is not strictly needed because those pages would get to the appropriate LRUs lazily during the reclaim path. Moreover follow_page_pte (gup) will drain the local pcp LRU cache. On larger machines the overhead of lru_add_drain_all() in mlock() can be significant when mlocking data already in memory. We have observed high latency in mlock() due to lru_add_drain_all() when the users were mlocking in memory tmpfs files. [mhocko@suse.com: changelog fix] Link: http://lkml.kernel.org/r/20171019222507.2894-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:07 -08:00
Mel Gorman	8667982014	mm, pagevec: remove cold parameter for pagevecs Every pagevec_init user claims the pages being released are hot even in cases where it is unlikely the pages are hot. As no one cares about the hotness of pages being released to the allocator, just ditch the parameter. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-11-15 18:21:06 -08:00
Greg Kroah-Hartman	b24413180f	License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-11-02 11:10:55 +01:00
Joonsoo Kim	9472f23c9e	mm/mlock.c: use page_zone() instead of page_zone_id() page_zone_id() is a specialized function to compare the zone for the pages that are within the section range. If the section of the pages are different, page_zone_id() can be different even if their zone is the same. This wrong usage doesn't cause any actual problem since __munlock_pagevec_fill() would be called again with failed index. However, it's better to use more appropriate function here. Link: http://lkml.kernel.org/r/1503559211-10259-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-08 18:26:47 -07:00
Yisheng Xie	70feee0e1e	mlock: fix mlock count can not decrease in race condition Kefeng reported that when running the follow test, the mlock count in meminfo will increase permanently: [1] testcase linux:~ # cat test_mlockal grep Mlocked /proc/meminfo for j in `seq 0 10` do for i in `seq 4 15` do ./p_mlockall >> log & done sleep 0.2 done # wait some time to let mlock counter decrease and 5s may not enough sleep 5 grep Mlocked /proc/meminfo linux:~ # cat p_mlockall.c #include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #define SPACE_LEN 4096 int main(int argc, char ** argv) { int ret; void *adr = malloc(SPACE_LEN); if (!adr) return -1; ret = mlockall(MCL_CURRENT \| MCL_FUTURE); printf("mlcokall ret = %d\n", ret); ret = munlockall(); printf("munlcokall ret = %d\n", ret); free(adr); return 0; } In __munlock_pagevec() we should decrement NR_MLOCK for each page where we clear the PageMlocked flag. Commit `1ebb7cc6a5` ("mm: munlock: batch NR_MLOCK zone state updates") has introduced a bug where we don't decrement NR_MLOCK for pages where we clear the flag, but fail to isolate them from the lru list (e.g. when the pages are on some other cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK accounting gets permanently disrupted by this. Fix it by counting the number of page whose PageMlock flag is cleared. Fixes: `1ebb7cc6a5` (" mm: munlock: batch NR_MLOCK zone state updates") Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joern Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-06-02 15:07:38 -07:00
Minchan Kim	192d723256	mm: make try_to_munlock() return void try_to_munlock returns SWAP_MLOCK if the one of VMAs mapped the page has VM_LOCKED flag. In that time, VM set PG_mlocked to the page if the page is not pte-mapped THP which cannot be mlocked, either. With that, __munlock_isolated_page can use PageMlocked to check whether try_to_munlock is successful or not without relying on try_to_munlock's retval. It helps to make try_to_unmap/try_to_unmap_one simple with upcoming patches. [minchan@kernel.org: remove PG_Mlocked VM_BUG_ON check] Link: http://lkml.kernel.org/r/20170411025615.GA6545@bbox Link: http://lkml.kernel.org/r/1489555493-14659-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:10 -07:00
Linus Torvalds	baeedc7158	Merge branch 'prep-for-5level' Merge 5-level page table prep from Kirill Shutemov: "Here's relatively low-risk part of 5-level paging patchset. Merging it now will make x86 5-level paging enabling in v4.12 easier. The first patch is actually x86-specific: detect 5-level paging support. It boils down to single define. The rest of patchset converts Linux MMU abstraction from 4- to 5-level paging. Enabling of new abstraction in most cases requires adding single line of code in arch-specific code. The rest is taken care by asm-generic/. Changes to mm/ code are mostly mechanical: add support for new page table level -- p4d_t -- where we deal with pud_t now. v2: - fix build on microblaze (Michal); - comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow(); - acks from Michal" * emailed patches from Kirill A Shutemov <kirill.shutemov@linux.intel.com>: mm: introduce __p4d_alloc() mm: convert generic code to 5-level paging asm-generic: introduce <asm-generic/pgtable-nop4d.h> arch, mm: convert all architectures to use 5level-fixup.h asm-generic: introduce __ARCH_USE_5LEVEL_HACK asm-generic: introduce 5level-fixup.h x86/cpufeature: Add 5-level paging detection	2017-03-10 08:59:07 -08:00
Kirill A. Shutemov	6ebb4a1b84	thp: fix another corner case of munlock() vs. THPs The following test case triggers BUG() in munlock_vma_pages_range(): int main(int argc, char *argv[]) { int fd; system("mount -t tmpfs -o huge=always none /mnt"); fd = open("/mnt/test", O_CREAT \| O_RDWR); ftruncate(fd, 4UL << 20); mmap(NULL, 4UL << 20, PROT_READ \| PROT_WRITE, MAP_SHARED \| MAP_FIXED \| MAP_LOCKED, fd, 0); mmap(NULL, 4096, PROT_READ \| PROT_WRITE, MAP_SHARED \| MAP_LOCKED, fd, 0); munlockall(); return 0; } The second mmap() create PTE-mapping of the first huge page in file. It makes kernel munlock the page as we never keep PTE-mapped page mlocked. On munlockall() when we handle vma created by the first mmap(), munlock_vma_page() returns page_mask == 0, as the page is not mlocked anymore. On next iteration follow_page_mask() return tail page, but page_mask is HPAGE_NR_PAGES - 1. It makes us skip to the first tail page of the next huge page and step on VM_BUG_ON_PAGE(PageMlocked(page)). The fix is not use the page_mask from follow_page_mask() at all. It has no use for us. Link: http://lkml.kernel.org/r/20170302150252.34120-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-03-09 17:01:10 -08:00
Kirill A. Shutemov	c2febafc67	mm: convert generic code to 5-level paging Convert all non-architecture-specific code to 5-level paging. It's mostly mechanical adding handling one more page table level in places where we deal with pud_t. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-03-09 11:48:47 -08:00
Ingo Molnar	8703e8a465	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h> We are going to split <linux/sched/user.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/user.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:29 +01:00
Kirill A. Shutemov	655548bf62	thp: fix corner case of munlock() of PTE-mapped THPs The following program triggers BUG() in munlock_vma_pages_range(): // autogenerated by syzkaller (http://github.com/google/syzkaller) #include <sys/mman.h> int main() { mmap((void)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0); mremap((void)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul); return 0; } The test-case constructs the situation when munlock_vma_pages_range() finds PTE-mapped THP-head in the middle of page table and, by mistake, skips HPAGE_PMD_NR pages after that. As result, on the next iteration it hits the middle of PMD-mapped THP and gets upset seeing mlocked tail page. The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked during munlock_vma_page(). It would guarantee that the page is PMD-mapped as we never mlock PTE-mapeed THPs. Fixes: `e90309c9f7` ("thp: allow mlocked THP again") Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: syzkaller <syzkaller@googlegroups.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-11-30 16:32:52 -08:00
Simon Guo	b155b4fde5	mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT) When one vma was with flag VM_LOCKED\|VM_LOCKONFAULT (by invoking mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with VM_LOCKED flag only. There is a hole in mlock_fixup() which increase mm->locked_vm twice even the two operations are on the same vma and both with VM_LOCKED flags. The issue can be reproduced by following code: mlock2(p, 1024 * 64, MLOCK_ONFAULT); //VM_LOCKED\|VM_LOCKONFAULT mlock(p, 1024 * 64); //VM_LOCKED Then check the increase VmLck field in /proc/pid/status(to 128k). When vma is set with different vm_flags, and the new vm_flags is with VM_LOCKED, it is not necessarily be a "new locked" vma. This patch corrects this bug by prevent mm->locked_vm from increment when old vm_flags is already VM_LOCKED. Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alexey Klimov <klimov.linux@gmail.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Guo <wei.guo.simon@gmail.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:28 -07:00
Simon Guo	0cf2f6f6dc	mm: mlock: check against vma for actual mlock() size In do_mlock(), the check against locked memory limitation has a hole which will fail following cases at step 3): 1) User has a memory chunk from addressA with 50k, and user mem lock rlimit is 64k. 2) mlock(addressA, 30k) 3) mlock(addressA, 40k) The 3rd step should have been allowed since the 40k request is intersected with the previous 30k at step 2), and the 3rd step is actually for mlock on the extra 10k memory. This patch checks vma to caculate the actual "new" mlock size, if necessary, and ajust the logic to fix this issue. [akpm@linux-foundation.org: clean up comment layout] [wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()] Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Cc: Alexey Klimov <klimov.linux@gmail.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Guo <wei.guo.simon@gmail.com> Cc: Thierry Reding <treding@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:28 -07:00
Mel Gorman	599d0c954f	mm, vmscan: move LRU lists to node This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	a52633d8e9	mm, vmscan: move lru_lock to the node Node-based reclaim requires node-based LRUs and locking. This is a preparation patch that just moves the lru_lock to the node so later patches are easier to review. It is a mechanical change but note this patch makes contention worse because the LRU lock is hotter and direct reclaim and kswapd can contend on the same lock even when reclaiming from different zones. Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Michal Hocko	dc0ef0df7b	mm: make mmap_sem for write waits killable for mm syscalls This is a follow up work for oom_reaper [1]. As the async OOM killing depends on oom_sem for read we would really appreciate if a holder for write didn't stood in the way. This patchset is changing many of down_write calls to be killable to help those cases when the writer is blocked and waiting for readers to release the lock and so help __oom_reap_task to process the oom victim. Most of the patches are really trivial because the lock is help from a shallow syscall paths where we can return EINTR trivially and allow the current task to die (note that EINTR will never get to the userspace as the task has fatal signal pending). Others seem to be easy as well as the callers are already handling fatal errors and bail and return to userspace which should be sufficient to handle the failure gracefully. I am not familiar with all those code paths so a deeper review is really appreciated. As this work is touching more areas which are not directly connected I have tried to keep the CC list as small as possible and people who I believed would be familiar are CCed only to the specific patches (all should have received the cover though). This patchset is based on linux-next and it depends on down_write_killable for rw_semaphores which got merged into tip locking/rwsem branch and it is merged into this next tree. I guess it would be easiest to route these patches via mmotm because of the dependency on the tip tree but if respective maintainers prefer other way I have no objections. I haven't covered all the mmap_write(mm->mmap_sem) instances here $ git grep "down_write(.\<mmap_sem\>)" next/master \| wc -l 98 $ git grep "down_write(.\<mmap_sem\>)" \| wc -l 62 I have tried to cover those which should be relatively easy to review in this series because this alone should be a nice improvement. Other places can be changed on top. [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org This patch (of 18): This is the first step in making mmap_sem write waiters killable. It focuses on the trivial ones which are taking the lock early after entering the syscall and they are not changing state before. Therefore it is very easy to change them to use down_write_killable and immediately return with -EINTR. This will allow the waiter to pass away without blocking the mmap_sem which might be required to make a forward progress. E.g. the oom reaper will need the lock for reading to dismantle the OOM victim address space. The only tricky function in this patch is vm_mmap_pgoff which has many call sites via vm_mmap. To reduce the risk keep vm_mmap with the original non-killable semantic for now. vm_munmap callers do not bother checking the return value so open code it into the munmap syscall path for now for simplicity. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-05-23 17:04:14 -07:00
Kirill A. Shutemov	7162a1e87b	mm: fix mlock accouting Tetsuo Handa reported underflow of NR_MLOCK on munlock. Testcase: #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #define BASE ((void )0x400000000000) #define SIZE (1UL << 21) int main(int argc, char argv[]) { void *addr; system("grep Mlocked /proc/meminfo"); addr = mmap(BASE, SIZE, PROT_READ \| PROT_WRITE, MAP_ANONYMOUS \| MAP_PRIVATE \| MAP_LOCKED \| MAP_FIXED, -1, 0); if (addr == MAP_FAILED) printf("mmap() failed\n"), exit(1); munmap(addr, SIZE); system("grep Mlocked /proc/meminfo"); return 0; } It happens on munlock_vma_page() due to unfortunate choice of nr_pages data type: __mod_zone_page_state(zone, NR_MLOCK, -nr_pages); For unsigned int nr_pages, implicitly casted to long in __mod_zone_page_state(), it becomes something around UINT_MAX. munlock_vma_page() usually called for THP as small pages go though pagevec. Let's make nr_pages signed int. Similar fixes in `6cdb18ad98` ("mm/vmstat: fix overflow in mod_zone_page_state()") used `long' type, but `int' here is OK for a count of the number of sub-pages in a huge page. Fixes: `ff6a6da60b` ("mm: accelerate munlock() treatment of THP pages") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Michel Lespinasse <walken@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-21 17:20:51 -08:00
Wang Xiaoqiang	7f43add451	mm/mlock.c: change can_do_mlock return value type to boolean Since can_do_mlock only return 1 or 0, so make it boolean. No functional change. [akpm@linux-foundation.org: update declaration in mm.h] Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Kirill A. Shutemov	e90309c9f7	thp: allow mlocked THP again Before THP refcounting rework, THP was not allowed to cross VMA boundary. So, if we have THP and we split it, PG_mlocked can be safely transferred to small pages. With new THP refcounting and naive approach to mlocking we can end up with this scenario: 1. we have a mlocked THP, which belong to one VM_LOCKED VMA. 2. the process does munlock() on the part of the THP: - the VMA is split into two, one of them VM_LOCKED; - huge PMD split into PTE table; - THP is still mlocked; 3. split_huge_page(): - it transfers PG_mlocked to all small pages regrardless if it blong to any VM_LOCKED VMA. We probably could munlock() all small pages on split_huge_page(), but I think we have accounting issue already on step two. Instead of forbidding mlocked pages altogether, we just avoid mlocking PTE-mapped THPs and munlock THPs on split_huge_pmd(). This means PTE-mapped THPs will be on normal lru lists and will be split under memory pressure by vmscan. After the split vmscan will detect unevictable small pages and mlock them. With this approach we shouldn't hit situation like described above. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Kirill A. Shutemov	7479df6da9	thp, mlock: do not allow huge pages in mlocked area With new refcounting THP can belong to several VMAs. This makes tricky to track THP pages, when they partially mlocked. It can lead to leaking mlocked pages to non-VM_LOCKED vmas and other problems. With this patch we will split all pages on mlock and avoid fault-in/collapse new THP in VM_LOCKED vmas. I've tried alternative approach: do not mark THP pages mlocked and keep them on normal LRUs. This way vmscan could try to split huge pages on memory pressure and free up subpages which doesn't belong to VM_LOCKED vmas. But this is user-visible change: we screw up Mlocked accouting reported in meminfo, so I had to leave this approach aside. We can bring something better later, but this should be good enough for now. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Alexey Klimov	ab7a5af7fd	mm/mlock.c: drop unneeded initialization in munlock_vma_pages_range() Before usage page pointer initialized by NULL is reinitialized by follow_page_mask(). Drop useless init of page pointer in the beginning of loop. Signed-off-by: Alexey Klimov <klimov.linux@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-14 16:00:49 -08:00
Eric B Munson	b0f205c2a3	mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage The previous patch introduced a flag that specified pages in a VMA should be placed on the unevictable LRU, but they should not be made present when the area is created. This patch adds the ability to set this state via the new mlock system calls. We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall. MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED. MCL_ONFAULT should be used as a modifier to the two other mlockall flags. When used with MCL_CURRENT, all current mappings will be marked with VM_LOCKED \| VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags will be marked with VM_LOCKED \| VM_LOCKONFAULT. When used with both MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be marked with VM_LOCKED \| VM_LOCKONFAULT. Prior to this patch, mlockall() will unconditionally clear the mm->def_flags any time it is called without MCL_FUTURE. This behavior is maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and new VMAs will be unlocked. This remains true with or without MCL_ONFAULT in either mlockall() invocation. munlock() will unconditionally clear both vma flags. munlockall() unconditionally clears for VMA flags on all VMAs and in the mm->def_flags field. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Eric B Munson	de60f5f10c	mm: introduce VM_LOCKONFAULT The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Eric B Munson	a8ca5d0ecb	mm: mlock: add new mlock system call With the refactored mlock code, introduce a new system call for mlock. The new call will allow the user to specify what lock states are being added. mlock2 is trivial at the moment, but a follow on patch will add a new mlock state making it useful. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Eric B Munson	1aab92ec3d	mm: mlock: refactor mlock, munlock, and munlockall code mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. Instead of forcing all locked pages to be present when they are allocated, this set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. This series introduces a new mlock() system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). There are two main use cases that this set covers. The first is the security focussed mlock case. A buffer is needed that cannot be written to swap. The maximum size is known, but on average the memory used is significantly less than this maximum. With lock on fault, the buffer is guaranteed to never be paged out without consuming the maximum size every time such a buffer is created. The second use case is focussed on performance. Portions of a large file are needed and we want to keep the used portions in memory once accessed. This is the case for large graphical models where the path through the graph is not known until run time. The entire graph is unlikely to be used in a given invocation, but once a node has been used it needs to stay resident for further processing. Given these constraints we have a number of options. We can potentially waste a large amount of memory by mlocking the entire region (this can also cause a significant stall at startup as the entire file is read in). We can mlock every page as we access them without tracking if the page is already resident but this introduces large overhead for each access. The third option is mapping the entire region with PROT_NONE and using a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the needed page. Doing this page at a time adds a significant performance penalty. Batching can be used to mitigate this overhead, but in order to safely avoid trying to mprotect pages outside of the mapping, the boundaries of each mapping to be used in this way must be tracked and available to the signal handler. This is precisely what the mm system in the kernel should already be doing. For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the user is charged as if MCL_FUTURE was used. This decision was made to keep the accounting checks out of the page fault path. To illustrate the benefit of this set I wrote a test program that mmaps a 5 GB file filled with random data and then makes 15,000,000 accesses to random addresses in that mapping. The test program was run 20 times for each setup. Results are reported for two program portions, setup and execution. The setup phase is calling mmap and optionally mlock on the entire region. For most experiments this is trivial, but it highlights the cost of faulting in the entire region. Results are averages across the 20 runs in milliseconds. mmap with mlock(MLOCK_LOCKED) on entire range: Setup avg: 8228.666 Processing avg: 8274.257 mmap with mlock(MLOCK_LOCKED) before each access: Setup avg: 0.113 Processing avg: 90993.552 mmap with PROT_NONE and signal handler and batch size of 1 page: With the default value in max_map_count, this gets ENOMEM as I attempt to change the permissions, after upping the sysctl significantly I get: Setup avg: 0.058 Processing avg: 69488.073 mmap with PROT_NONE and signal handler and batch size of 8 pages: Setup avg: 0.068 Processing avg: 38204.116 mmap with PROT_NONE and signal handler and batch size of 16 pages: Setup avg: 0.044 Processing avg: 29671.180 mmap with mlock(MLOCK_ONFAULT) on entire range: Setup avg: 0.189 Processing avg: 17904.899 The signal handler in the batch cases faulted in memory in two steps to avoid having to know the start and end of the faulting mapping. The first step covers the page that caused the fault as we know that it will be possible to lock. The second step speculatively tries to mlock and mprotect the batch size - 1 pages that follow. There may be a clever way to avoid this without having the program track each mapping to be covered by this handeler in a globally accessible structure, but I could not find it. It should be noted that with a large enough batch size this two step fault handler can still cause the program to crash if it reaches far beyond the end of the mapping. These results show that if the developer knows that a majority of the mapping will be used, it is better to try and fault it in at once, otherwise mlock(MLOCK_ONFAULT) is significantly faster. The performance cost of these patches are minimal on the two benchmarks I have tested (stream and kernbench). The following are the average values across 20 runs of stream and 10 runs of kernbench after a warmup run whose results were discarded. Avg throughput in MB/s from stream using 1000000 element arrays Test 4.2-rc1 4.2-rc1+lock-on-fault Copy: 10,566.5 10,421 Scale: 10,685 10,503.5 Add: 12,044.1 11,814.2 Triad: 12,064.8 11,846.3 Kernbench optimal load 4.2-rc1 4.2-rc1+lock-on-fault Elapsed Time 78.453 78.991 User Time 64.2395 65.2355 System Time 9.7335 9.7085 Context Switches 22211.5 22412.1 Sleeps 14965.3 14956.1 This patch (of 6): Extending the mlock system call is very difficult because it currently does not take a flags argument. A later patch in this set will extend mlock to support a middle ground between pages that are locked and faulted in immediately and unlocked pages. To pave the way for the new system call, the code needs some reorganization so that all the actual entry point handles is checking input and translating to VMA flags. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Alexander Kuleshov	8fd9e4883a	mm/mlock: use offset_in_page macro linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Alexey Klimov	86d2adccfb	mm/mlock.c: reorganize mlockall() return values and remove goto-out label In mlockall syscall wrapper after out-label for goto code just doing return. Remove goto out statements and return error values directly. Also instead of rewriting ret variable before every if-check move returns to 'error'-like path under if-check. Objdump asm listing showed me reducing by few asm lines. Object file size descreased from 220592 bytes to 220528 bytes for me (for aarch64). Signed-off-by: Alexey Klimov <klimov.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-05 19:34:48 -08:00
Andrea Arcangeli	19a809afe2	userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge must be aware about so that we can merge vmas back like they were originally before arming the userfaultfd on some memory range. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 16:54:41 -07:00
Kirill A. Shutemov	acc3c8d15e	mm: move mm_populate()-related code to mm/gup.c It's odd that we have populate_vma_page_range() and __mm_populate() in mm/mlock.c. It's implementation of generic memory population and mlocking is one of possible side effect, if VM_LOCKED is set. __get_user_pages() is core of the implementation. Let's move the code into mm/gup.c. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-04-14 16:49:00 -07:00
Kirill A. Shutemov	c561259ca7	mm: move gup() -> posix mlock() error conversion out of __mm_populate This is praparation to moving mm_populate()-related code out of mm/mlock.c. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-04-14 16:49:00 -07:00
Kirill A. Shutemov	fc05f56621	mm: rename __mlock_vma_pages_range() to populate_vma_page_range() __mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on vma flags. The same codepath is used for MAP_POPULATE. Let's rename __mlock_vma_pages_range() to populate_vma_page_range(). This patch also drops mlock_vma_pages_range() references from documentation. It has gone in `cea10a19b7` ("mm: directly use __mlock_vma_pages_range() in find_extend_vma()"). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-04-14 16:49:00 -07:00
Kirill A. Shutemov	84d33df279	mm: rename FOLL_MLOCK to FOLL_POPULATE After commit `a1fde08c74` ("VM: skip the stack guard page lookup in get_user_pages only for mlock") FOLL_MLOCK has lost its original meaning: we don't necessarily mlock the page if the flags is set -- we also take VM_LOCKED into consideration. Since we use the same codepath for __mm_populate(), let's rename FOLL_MLOCK to FOLL_POPULATE. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-04-14 16:48:59 -07:00
Jeff Vander Stoep	a5a6579db3	mm: reorder can_do_mlock to fix audit denial A userspace call to mmap(MAP_LOCKED) may result in the successful locking of memory while also producing a confusing audit log denial. can_do_mlock checks capable and rlimit. If either of these return positive can_do_mlock returns true. The capable check leads to an LSM hook used by apparmour and selinux which produce the audit denial. Reordering so rlimit is checked first eliminates the denial on success, only recording a denial when the lock is unsuccessful as a result of the denial. Signed-off-by: Jeff Vander Stoep <jeffv@google.com> Acked-by: Nick Kralevich <nnk@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-03-12 18:46:08 -07:00
Linus Torvalds	d6dd50e07c	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes in this cycle were: - changes related to No-CBs CPUs and NO_HZ_FULL - RCU-tasks implementation - torture-test updates - miscellaneous fixes - locktorture updates - RCU documentation updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits) workqueue: Use cond_resched_rcu_qs macro workqueue: Add quiescent state between work items locktorture: Cleanup header usage locktorture: Cannot hold read and write lock locktorture: Fix __acquire annotation for spinlock irq locktorture: Support rwlocks rcu: Eliminate deadlock between CPU hotplug and expedited grace periods locktorture: Document boot/module parameters rcutorture: Rename rcutorture_runnable parameter locktorture: Add test scenario for rwsem_lock locktorture: Add test scenario for mutex_lock locktorture: Make torture scripting account for new _runnable name locktorture: Introduce torture context locktorture: Support rwsems locktorture: Add infrastructure for torturing read locks torture: Address race in module cleanup locktorture: Make statistics generic locktorture: Teach about lock debugging locktorture: Support mutexes locktorture: Add documentation ...	2014-10-13 15:44:12 +02:00
Sasha Levin	96dad67ff2	mm: use VM_BUG_ON_MM where possible Dump the contents of the relevant struct_mm when we hit the bug condition. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-09 22:25:58 -04:00
Sasha Levin	81d1b09c6b	mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMA Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract more information when they trigger. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-09 22:25:57 -04:00
Paul E. McKenney	bde6c3aa99	rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops RCU-tasks requires the occasional voluntary context switch from CPU-bound in-kernel tasks. In some cases, this requires instrumenting cond_resched(). However, there is some reluctance to countenance unconditionally instrumenting cond_resched() (see http://lwn.net/Articles/603252/), so this commit creates a separate cond_resched_rcu_qs() that may be used in place of cond_resched() in locations prone to long-duration in-kernel looping. This commit currently instruments only RCU-tasks. Future possibilities include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce IPI usage. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2014-09-07 16:27:20 -07:00
Paul Cassella	9a95f3cf7b	mm: describe mmap_sem rules for __lock_page_or_retry() and callers Add a comment describing the circumstances in which __lock_page_or_retry() will or will not release the mmap_sem when returning 0. Add comments to lock_page_or_retry()'s callers (filemap_fault(), do_swap_page()) noting the impact on VM_FAULT_RETRY returns. Add comments on up the call tree, particularly replacing the false "We return with mmap_sem still held" comments. Signed-off-by: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-08-06 18:01:20 -07:00

1 2 3

126 Commits