Commit Graph

1429 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
291d853dff Merge 4.19.88 into android-4.19
Changes in 4.19.88
	clk: meson: gxbb: let sar_adc_clk_div set the parent clock rate
	clocksource/drivers/mediatek: Fix error handling
	ASoC: msm8916-wcd-analog: Fix RX1 selection in RDAC2 MUX
	ASoC: compress: fix unsigned integer overflow check
	reset: Fix memory leak in reset_control_array_put()
	clk: samsung: exynos5433: Fix error paths
	ASoC: kirkwood: fix external clock probe defer
	ASoC: kirkwood: fix device remove ordering
	clk: samsung: exynos5420: Preserve PLL configuration during suspend/resume
	pinctrl: cherryview: Allocate IRQ chip dynamic
	ARM: dts: imx6qdl-sabreauto: Fix storm of accelerometer interrupts
	reset: fix reset_control_ops kerneldoc comment
	clk: at91: avoid sleeping early
	clk: sunxi: Fix operator precedence in sunxi_divs_clk_setup
	clk: sunxi-ng: a80: fix the zero'ing of bits 16 and 18
	ARM: dts: sun8i-a83t-tbs-a711: Fix WiFi resume from suspend
	samples/bpf: fix build by setting HAVE_ATTR_TEST to zero
	powerpc/bpf: Fix tail call implementation
	idr: Fix integer overflow in idr_for_each_entry
	idr: Fix idr_alloc_u32 on 32-bit systems
	x86/resctrl: Prevent NULL pointer dereference when reading mondata
	clk: ti: dra7-atl-clock: Remove ti_clk_add_alias call
	clk: ti: clkctrl: Fix failed to enable error with double udelay timeout
	net: fec: add missed clk_disable_unprepare in remove
	bridge: ebtables: don't crash when using dnat target in output chains
	can: peak_usb: report bus recovery as well
	can: c_can: D_CAN: c_can_chip_config(): perform a sofware reset on open
	can: rx-offload: can_rx_offload_queue_tail(): fix error handling, avoid skb mem leak
	can: rx-offload: can_rx_offload_offload_one(): do not increase the skb_queue beyond skb_queue_len_max
	can: rx-offload: can_rx_offload_offload_one(): increment rx_fifo_errors on queue overflow or OOM
	can: rx-offload: can_rx_offload_offload_one(): use ERR_PTR() to propagate error value in case of errors
	can: rx-offload: can_rx_offload_irq_offload_timestamp(): continue on error
	can: rx-offload: can_rx_offload_irq_offload_fifo(): continue on error
	can: flexcan: increase error counters if skb enqueueing via can_rx_offload_queue_sorted() fails
	can: mcp251x: mcp251x_restart_work_handler(): Fix potential force_quit race condition
	watchdog: meson: Fix the wrong value of left time
	ASoC: stm32: sai: add restriction on mmap support
	scripts/gdb: fix debugging modules compiled with hot/cold partitioning
	net: bcmgenet: use RGMII loopback for MAC reset
	net: bcmgenet: reapply manual settings to the PHY
	net: mscc: ocelot: fix __ocelot_rmw_ix prototype
	ceph: return -EINVAL if given fsc mount option on kernel w/o support
	net/fq_impl: Switch to kvmalloc() for memory allocation
	mac80211: fix station inactive_time shortly after boot
	block: drbd: remove a stray unlock in __drbd_send_protocol()
	pwm: bcm-iproc: Prevent unloading the driver module while in use
	scsi: target/tcmu: Fix queue_cmd_ring() declaration
	scsi: lpfc: Fix kernel Oops due to null pring pointers
	scsi: lpfc: Fix dif and first burst use in write commands
	ARM: dts: Fix up SQ201 flash access
	tracing: Lock event_mutex before synth_event_mutex
	ARM: debug-imx: only define DEBUG_IMX_UART_PORT if needed
	ARM: dts: imx51: Fix memory node duplication
	ARM: dts: imx53: Fix memory node duplication
	ARM: dts: imx31: Fix memory node duplication
	ARM: dts: imx35: Fix memory node duplication
	ARM: dts: imx7: Fix memory node duplication
	ARM: dts: imx6ul: Fix memory node duplication
	ARM: dts: imx6sx: Fix memory node duplication
	ARM: dts: imx6sl: Fix memory node duplication
	ARM: dts: imx50: Fix memory node duplication
	ARM: dts: imx23: Fix memory node duplication
	ARM: dts: imx1: Fix memory node duplication
	ARM: dts: imx27: Fix memory node duplication
	ARM: dts: imx25: Fix memory node duplication
	ARM: dts: imx53-voipac-dmm-668: Fix memory node duplication
	parisc: Fix serio address output
	parisc: Fix HP SDC hpa address output
	ARM: dts: Fix hsi gdd range for omap4
	arm64: mm: Prevent mismatched 52-bit VA support
	arm64: smp: Handle errors reported by the firmware
	bus: ti-sysc: Check for no-reset and no-idle flags at the child level
	platform/x86: mlx-platform: Fix LED configuration
	ARM: OMAP1: fix USB configuration for device-only setups
	RDMA/hns: Fix the bug while use multi-hop of pbl
	arm64: preempt: Fix big-endian when checking preempt count in assembly
	RDMA/vmw_pvrdma: Use atomic memory allocation in create AH
	PM / AVS: SmartReflex: NULL check before some freeing functions is not needed
	xfs: zero length symlinks are not valid
	ARM: ks8695: fix section mismatch warning
	ACPI / LPSS: Ignore acpi_device_fix_up_power() return value
	scsi: lpfc: Enable Management features for IF_TYPE=6
	scsi: qla2xxx: Fix NPIV handling for FC-NVMe
	scsi: qla2xxx: Fix for FC-NVMe discovery for NPIV port
	nvme: provide fallback for discard alloc failure
	s390/zcrypt: make sysfs reset attribute trigger queue reset
	crypto: user - support incremental algorithm dumps
	arm64: dts: renesas: draak: Fix CVBS input
	mwifiex: fix potential NULL dereference and use after free
	mwifiex: debugfs: correct histogram spacing, formatting
	brcmfmac: set F2 watermark to 256 for 4373
	brcmfmac: set SDIO F1 MesBusyCtrl for CYW4373
	rtl818x: fix potential use after free
	bcache: do not check if debug dentry is ERR or NULL explicitly on remove
	bcache: do not mark writeback_running too early
	xfs: require both realtime inodes to mount
	nvme: fix kernel paging oops
	ubifs: Fix default compression selection in ubifs
	ubi: Put MTD device after it is not used
	ubi: Do not drop UBI device reference before using
	microblaze: adjust the help to the real behavior
	microblaze: move "... is ready" messages to arch/microblaze/Makefile
	microblaze: fix multiple bugs in arch/microblaze/boot/Makefile
	iwlwifi: move iwl_nvm_check_version() into dvm
	iwlwifi: mvm: force TCM re-evaluation on TCM resume
	iwlwifi: pcie: fix erroneous print
	iwlwifi: pcie: set cmd_len in the correct place
	gpio: pca953x: Fix AI overflow on PCAL6524
	gpiolib: Fix return value of gpio_to_desc() stub if !GPIOLIB
	kvm: vmx: Set IA32_TSC_AUX for legacy mode guests
	Revert "KVM: nVMX: reset cache/shadows when switching loaded VMCS"
	Revert "KVM: nVMX: move check_vmentry_postreqs() call to nested_vmx_enter_non_root_mode()"
	crypto/chelsio/chtls: listen fails with multiadapt
	VSOCK: bind to random port for VMADDR_PORT_ANY
	mmc: meson-gx: make sure the descriptor is stopped on errors
	mtd: rawnand: sunxi: Write pageprog related opcodes to WCMD_SET
	usb: ehci-omap: Fix deferred probe for phy handling
	btrfs: Check for missing device before bio submission in btrfs_map_bio
	btrfs: fix ncopies raid_attr for RAID56
	btrfs: dev-replace: set result code of cancel by status of scrub
	Btrfs: allow clear_extent_dirty() to receive a cached extent state record
	btrfs: only track ref_heads in delayed_ref_updates
	serial: sh-sci: Fix crash in rx_timer_fn() on PIO fallback
	HID: intel-ish-hid: fixes incorrect error handling
	gpio: raspberrypi-exp: decrease refcount on firmware dt node
	serial: 8250: Rate limit serial port rx interrupts during input overruns
	kprobes/x86/xen: blacklist non-attachable xen interrupt functions
	xen/pciback: Check dev_data before using it
	kprobes: Blacklist symbols in arch-defined prohibited area
	kprobes/x86: Show x86-64 specific blacklisted symbols correctly
	vfio-mdev/samples: Use u8 instead of char for handle functions
	memory: omap-gpmc: Get the header of the enum
	pinctrl: xway: fix gpio-hog related boot issues
	net/mlx5: Continue driver initialization despite debugfs failure
	netfilter: nf_nat_sip: fix RTP/RTCP source port translations
	exofs_mount(): fix leaks on failure exits
	bnxt_en: Return linux standard errors in bnxt_ethtool.c
	bnxt_en: Save ring statistics before reset.
	bnxt_en: query force speeds before disabling autoneg mode.
	KVM: s390: unregister debug feature on failing arch init
	pinctrl: sh-pfc: r8a77990: Fix MOD_SEL0 SEL_I2C1 field width
	pinctrl: sh-pfc: sh7264: Fix PFCR3 and PFCR0 register configuration
	pinctrl: sh-pfc: sh7734: Fix shifted values in IPSR10
	HID: doc: fix wrong data structure reference for UHID_OUTPUT
	dm flakey: Properly corrupt multi-page bios.
	gfs2: take jdata unstuff into account in do_grow
	dm raid: fix false -EBUSY when handling check/repair message
	xfs: Align compat attrlist_by_handle with native implementation.
	xfs: Fix bulkstat compat ioctls on x32 userspace.
	IB/qib: Fix an error code in qib_sdma_verbs_send()
	clocksource/drivers/fttmr010: Fix invalid interrupt register access
	vxlan: Fix error path in __vxlan_dev_create()
	powerpc/book3s/32: fix number of bats in p/v_block_mapped()
	powerpc/xmon: fix dump_segments()
	drivers/regulator: fix a missing check of return value
	Bluetooth: hci_bcm: Handle specific unknown packets after firmware loading
	serial: max310x: Fix tx_empty() callback
	openrisc: Fix broken paths to arch/or32
	RDMA/srp: Propagate ib_post_send() failures to the SCSI mid-layer
	scsi: qla2xxx: deadlock by configfs_depend_item
	scsi: csiostor: fix incorrect dma device in case of vport
	brcmfmac: Fix access point mode
	ath6kl: Only use match sets when firmware supports it
	ath6kl: Fix off by one error in scan completion
	powerpc/perf: Fix unit_sel/cache_sel checks
	powerpc/32: Avoid unsupported flags with clang
	powerpc/prom: fix early DEBUG messages
	powerpc/mm: Make NULL pointer deferences explicit on bad page faults.
	powerpc/44x/bamboo: Fix PCI range
	vfio/spapr_tce: Get rid of possible infinite loop
	powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status
	drbd: ignore "all zero" peer volume sizes in handshake
	drbd: reject attach of unsuitable uuids even if connected
	drbd: do not block when adjusting "disk-options" while IO is frozen
	drbd: fix print_st_err()'s prototype to match the definition
	IB/rxe: Make counters thread safe
	bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't
	regulator: tps65910: fix a missing check of return value
	powerpc/83xx: handle machine check caused by watchdog timer
	powerpc/pseries: Fix node leak in update_lmb_associativity_index()
	powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y
	crypto: mxc-scc - fix build warnings on ARM64
	pwm: clps711x: Fix period calculation
	net/netlink_compat: Fix a missing check of nla_parse_nested
	net/net_namespace: Check the return value of register_pernet_subsys()
	f2fs: fix block address for __check_sit_bitmap
	f2fs: fix to dirty inode synchronously
	um: Include sys/uio.h to have writev()
	um: Make GCOV depend on !KCOV
	net: (cpts) fix a missing check of clk_prepare
	net: stmicro: fix a missing check of clk_prepare
	net: dsa: bcm_sf2: Propagate error value from mdio_write
	atl1e: checking the status of atl1e_write_phy_reg
	tipc: fix a missing check of genlmsg_put
	net: marvell: fix a missing check of acpi_match_device
	net/wan/fsl_ucc_hdlc: Avoid double free in ucc_hdlc_probe()
	ocfs2: clear journal dirty flag after shutdown journal
	vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n
	mm/page_alloc.c: free order-0 pages through PCP in page_frag_free()
	mm/page_alloc.c: use a single function to free page
	mm/page_alloc.c: deduplicate __memblock_free_early() and memblock_free()
	tools/vm/page-types.c: fix "kpagecount returned fewer pages than expected" failures
	netfilter: nf_tables: fix a missing check of nla_put_failure
	xprtrdma: Prevent leak of rpcrdma_rep objects
	infiniband: bnxt_re: qplib: Check the return value of send_message
	infiniband/qedr: Potential null ptr dereference of qp
	firmware: arm_sdei: fix wrong of_node_put() in init function
	firmware: arm_sdei: Fix DT platform device creation
	lib/genalloc.c: fix allocation of aligned buffer from non-aligned chunk
	lib/genalloc.c: use vzalloc_node() to allocate the bitmap
	fork: fix some -Wmissing-prototypes warnings
	drivers/base/platform.c: kmemleak ignore a known leak
	lib/genalloc.c: include vmalloc.h
	mtd: Check add_mtd_device() ret code
	tipc: fix memory leak in tipc_nl_compat_publ_dump
	net/core/neighbour: tell kmemleak about hash tables
	ata: ahci: mvebu: do Armada 38x configuration only on relevant SoCs
	PCI/MSI: Return -ENOSPC from pci_alloc_irq_vectors_affinity()
	net/core/neighbour: fix kmemleak minimal reference count for hash tables
	serial: 8250: Fix serial8250 initialization crash
	gpu: ipu-v3: pre: don't trigger update if buffer address doesn't change
	sfc: suppress duplicate nvmem partition types in efx_ef10_mtd_probe
	ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel
	decnet: fix DN_IFREQ_SIZE
	net/smc: prevent races between smc_lgr_terminate() and smc_conn_free()
	net/smc: don't wait for send buffer space when data was already sent
	mm/hotplug: invalid PFNs from pfn_to_online_page()
	xfs: end sync buffer I/O properly on shutdown error
	net/smc: fix sender_free computation
	blktrace: Show requests without sector
	net/smc: fix byte_order for rx_curs_confirmed
	tipc: fix skb may be leaky in tipc_link_input
	ASoC: samsung: i2s: Fix prescaler setting for the secondary DAI
	sfc: initialise found bitmap in efx_ef10_mtd_probe
	geneve: change NET_UDP_TUNNEL dependency to select
	net: fix possible overflow in __sk_mem_raise_allocated()
	net: ip_gre: do not report erspan_ver for gre or gretap
	net: ip6_gre: do not report erspan_ver for ip6gre or ip6gretap
	sctp: don't compare hb_timer expire date before starting it
	bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id()
	mmc: core: align max segment size with logical block size
	net: dev: Use unsigned integer as an argument to left-shift
	kvm: properly check debugfs dentry before using it
	bpf: drop refcount if bpf_map_new_fd() fails in map_create()
	net: hns3: Change fw error code NOT_EXEC to NOT_SUPPORTED
	net: hns3: fix PFC not setting problem for DCB module
	net: hns3: fix an issue for hclgevf_ae_get_hdev
	net: hns3: fix an issue for hns3_update_new_int_gl
	iommu/amd: Fix NULL dereference bug in match_hid_uid
	apparmor: delete the dentry in aafs_remove() to avoid a leak
	scsi: libsas: Support SATA PHY connection rate unmatch fixing during discovery
	ACPI / APEI: Don't wait to serialise with oops messages when panic()ing
	ACPI / APEI: Switch estatus pool to use vmalloc memory
	scsi: hisi_sas: shutdown axi bus to avoid exception CQ returned
	scsi: libsas: Check SMP PHY control function result
	RDMA/hns: Fix the bug with updating rq head pointer when flush cqe
	RDMA/hns: Bugfix for the scene without receiver queue
	RDMA/hns: Fix the state of rereg mr
	RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp
	ASoC: rt5645: Headphone Jack sense inverts on the LattePanda board
	powerpc/pseries/dlpar: Fix a missing check in dlpar_parse_cc_property()
	xdp: fix cpumap redirect SKB creation bug
	mtd: Remove a debug trace in mtdpart.c
	mm, gup: add missing refcount overflow checks on s390
	clk: at91: fix update bit maps on CFG_MOR write
	clk: at91: generated: set audio_pll_allowed in at91_clk_register_generated()
	usb: dwc2: use a longer core rest timeout in dwc2_core_reset()
	staging: rtl8192e: fix potential use after free
	staging: rtl8723bs: Drop ACPI device ids
	staging: rtl8723bs: Add 024c:0525 to the list of SDIO device-ids
	USB: serial: ftdi_sio: add device IDs for U-Blox C099-F9P
	mei: bus: prefix device names on bus with the bus name
	mei: me: add comet point V device id
	thunderbolt: Power cycle the router if NVM authentication fails
	xfrm: Fix memleak on xfrm state destroy
	media: v4l2-ctrl: fix flags for DO_WHITE_BALANCE
	net: macb: fix error format in dev_err()
	pwm: Clear chip_data in pwm_put()
	media: atmel: atmel-isc: fix asd memory allocation
	media: atmel: atmel-isc: fix INIT_WORK misplacement
	macvlan: schedule bc_work even if error
	net: psample: fix skb_over_panic
	openvswitch: fix flow command message size
	sctp: Fix memory leak in sctp_sf_do_5_2_4_dupcook
	slip: Fix use-after-free Read in slip_open
	openvswitch: drop unneeded BUG_ON() in ovs_flow_cmd_build_info()
	openvswitch: remove another BUG_ON()
	selftests: bpf: test_sockmap: handle file creation failures gracefully
	tipc: fix link name length check
	sctp: cache netns in sctp_ep_common
	net: sched: fix `tc -s class show` no bstats on class with nolock subqueues
	net: macb: add missed tasklet_kill
	ext4: add more paranoia checking in ext4_expand_extra_isize handling
	watchdog: sama5d4: fix WDD value to be always set to max
	net: macb: Fix SUBNS increment and increase resolution
	net: macb driver, check for SKBTX_HW_TSTAMP
	mtd: rawnand: atmel: Fix spelling mistake in error message
	mtd: rawnand: atmel: fix possible object reference leak
	mtd: spi-nor: cast to u64 to avoid uint overflows
	drm/atmel-hlcdc: revert shift by 8
	mailbox: stm32_ipcc: add spinlock to fix channels concurrent access
	tcp: exit if nothing to retransmit on RTO timeout
	HID: core: check whether Usage Page item is after Usage ID items
	crypto: stm32/hash - Fix hmac issue more than 256 bytes
	media: stm32-dcmi: fix DMA corruption when stopping streaming
	media: stm32-dcmi: fix check of pm_runtime_get_sync return value
	hwrng: stm32 - fix unbalanced pm_runtime_enable
	clk: stm32mp1: fix HSI divider flag
	clk: stm32mp1: fix mcu divider table
	clk: stm32mp1: add CLK_SET_RATE_NO_REPARENT to Kernel clocks
	clk: stm32mp1: parent clocks update
	mailbox: mailbox-test: fix null pointer if no mmio
	pinctrl: stm32: fix memory leak issue
	ASoC: stm32: i2s: fix dma configuration
	ASoC: stm32: i2s: fix 16 bit format support
	ASoC: stm32: i2s: fix IRQ clearing
	ASoC: stm32: sai: add missing put_device()
	dmaengine: stm32-dma: check whether length is aligned on FIFO threshold
	platform/x86: hp-wmi: Fix ACPI errors caused by too small buffer
	platform/x86: hp-wmi: Fix ACPI errors caused by passing 0 as input size
	net: fec: fix clock count mis-match
	Linux 4.19.88

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifd3801a77cb551be72788031e7fcfc8a1d4fd197
2019-12-05 12:02:49 +01:00
Aaron Lu
9696c7656e mm/page_alloc.c: use a single function to free page
[ Upstream commit 742aa7fb52c56fb3b307e704f93e67b698959cc2 ]

There are multiple places of freeing a page, they all do the same things
so a common function can be used to reduce code duplicate.

It also avoids bug fixed in one function but left in another.

Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pankaj gupta <pagupta@redhat.com>
Cc: Pawel Staszewski <pstaszewski@itcare.pl>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05 09:20:58 +01:00
Aaron Lu
257ad5fbfe mm/page_alloc.c: free order-0 pages through PCP in page_frag_free()
[ Upstream commit 65895b67ad27df0f62bfaf82dd5622f95ea29196 ]

page_frag_free() calls __free_pages_ok() to free the page back to Buddy.
This is OK for high order page, but for order-0 pages, it misses the
optimization opportunity of using Per-Cpu-Pages and can cause zone lock
contention when called frequently.

Pawel Staszewski recently shared his result of 'how Linux kernel handles
normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the
lock contention comes from page allocator:

  mlx5e_poll_tx_cq
  |
   --16.34%--napi_consume_skb
             |
             |--12.65%--__free_pages_ok
             |          |
             |           --11.86%--free_one_page
             |                     |
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |
             |                      --0.65%--_raw_spin_lock
             |
             |--1.55%--page_frag_free
             |
              --1.44%--skb_release_data

Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is
not effective in this workload and pages have to go through the page
allocator.  The lock contention happens during mlx5 DMA TX completion
cycle.  And the page allocator cannot keep up at these speeds.[2]

I thought that __free_pages_ok() are mostly freeing high order pages and
thought this is an lock contention for high order pages but Jesper
explained in detail that __free_pages_ok() here are actually freeing
order-0 pages because mlx5 is using order-0 pages to satisfy its page pool
allocation request.[3]

The free path as pointed out by Jesper is:
skb_free_head()
  -> skb_free_frag()
    -> page_frag_free()
And the pages being freed on this path are order-0 pages.

Fix this by doing similar things as in __page_frag_cache_drain() - send
the being freed page to PCP if it's an order-0 page, or directly to Buddy
if it is a high order page.

With this change, Paweł hasn't noticed lock contention yet in his
workload and Jesper has noticed a 7% performance improvement using a micro
benchmark and lock contention is gone.  Ilias' test on a 'low' speed 1Gbit
interface on an cortex-a53 shows ~11% performance boost testing with
64byte packets and __free_pages_ok() disappeared from perf top.

[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/20181120014544.GB10657@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05 09:20:57 +01:00
Sami Tolvanen
9c265248fa FROMLIST: scs: add accounting
This change adds accounting for the memory allocated for shadow stacks.

Bug: 145210207
Change-Id: I51157fe0b23b4cb28bb33c86a5dfe3ac911296a4
(am from https://lore.kernel.org/patchwork/patch/1149055/)
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2019-11-27 12:37:25 -08:00
Greg Kroah-Hartman
b777b6f211 Merge 4.19.84 into android-4.19
Changes in 4.19.84
	bonding: fix state transition issue in link monitoring
	CDC-NCM: handle incomplete transfer of MTU
	ipv4: Fix table id reference in fib_sync_down_addr
	net: ethernet: octeon_mgmt: Account for second possible VLAN header
	net: fix data-race in neigh_event_send()
	net: qualcomm: rmnet: Fix potential UAF when unregistering
	net: usb: qmi_wwan: add support for DW5821e with eSIM support
	NFC: fdp: fix incorrect free object
	nfc: netlink: fix double device reference drop
	NFC: st21nfca: fix double free
	qede: fix NULL pointer deref in __qede_remove()
	net: mscc: ocelot: don't handle netdev events for other netdevs
	net: mscc: ocelot: fix NULL pointer on LAG slave removal
	ipv6: fixes rt6_probe() and fib6_nh->last_probe init
	net: hns: Fix the stray netpoll locks causing deadlock in NAPI path
	ALSA: timer: Fix incorrectly assigned timer instance
	ALSA: bebob: fix to detect configured source of sampling clock for Focusrite Saffire Pro i/o series
	ALSA: hda/ca0132 - Fix possible workqueue stall
	mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
	mm, meminit: recalculate pcpu batch and high limits after init completes
	mm: thp: handle page cache THP correctly in PageTransCompoundMap
	mm, vmstat: hide /proc/pagetypeinfo from normal users
	dump_stack: avoid the livelock of the dump_lock
	tools: gpio: Use !building_out_of_srctree to determine srctree
	perf tools: Fix time sorting
	drm/radeon: fix si_enable_smc_cac() failed issue
	HID: wacom: generic: Treat serial number and related fields as unsigned
	soundwire: depend on ACPI
	soundwire: bus: set initial value to port_status
	arm64: Do not mask out PTE_RDONLY in pte_same()
	ceph: fix use-after-free in __ceph_remove_cap()
	ceph: add missing check in d_revalidate snapdir handling
	iio: adc: stm32-adc: fix stopping dma
	iio: imu: adis16480: make sure provided frequency is positive
	iio: srf04: fix wrong limitation in distance measuring
	ARM: sunxi: Fix CPU powerdown on A83T
	netfilter: nf_tables: Align nft_expr private data to 64-bit
	netfilter: ipset: Fix an error code in ip_set_sockfn_get()
	intel_th: pci: Add Comet Lake PCH support
	intel_th: pci: Add Jasper Lake PCH support
	x86/apic/32: Avoid bogus LDR warnings
	SMB3: Fix persistent handles reconnect
	can: usb_8dev: fix use-after-free on disconnect
	can: flexcan: disable completely the ECC mechanism
	can: c_can: c_can_poll(): only read status register after status IRQ
	can: peak_usb: fix a potential out-of-sync while decoding packets
	can: rx-offload: can_rx_offload_queue_sorted(): fix error handling, avoid skb mem leak
	can: gs_usb: gs_can_open(): prevent memory leak
	can: dev: add missing of_node_put() after calling of_get_child_by_name()
	can: mcba_usb: fix use-after-free on disconnect
	can: peak_usb: fix slab info leak
	configfs: stash the data we need into configfs_buffer at open time
	configfs_register_group() shouldn't be (and isn't) called in rmdirable parts
	configfs: new object reprsenting tree fragments
	configfs: provide exclusion between IO and removals
	configfs: fix a deadlock in configfs_symlink()
	ALSA: usb-audio: More validations of descriptor units
	ALSA: usb-audio: Simplify parse_audio_unit()
	ALSA: usb-audio: Unify the release of usb_mixer_elem_info objects
	ALSA: usb-audio: Remove superfluous bLength checks
	ALSA: usb-audio: Clean up check_input_term()
	ALSA: usb-audio: Fix possible NULL dereference at create_yamaha_midi_quirk()
	ALSA: usb-audio: remove some dead code
	ALSA: usb-audio: Fix copy&paste error in the validator
	sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
	sched/fair: Fix -Wunused-but-set-variable warnings
	usbip: Fix vhci_urb_enqueue() URB null transfer buffer error path
	usbip: Implement SG support to vhci-hcd and stub driver
	PCI: tegra: Enable Relaxed Ordering only for Tegra20 & Tegra30
	HID: google: add magnemite/masterball USB ids
	dmaengine: xilinx_dma: Fix control reg update in vdma_channel_set_config
	dmaengine: sprd: Fix the possible memory leak issue
	HID: intel-ish-hid: fix wrong error handling in ishtp_cl_alloc_tx_ring()
	RDMA/mlx5: Clear old rate limit when closing QP
	iw_cxgb4: fix ECN check on the passive accept
	RDMA/qedr: Fix reported firmware version
	net/mlx5e: TX, Fix consumer index of error cqe dump
	net/mlx5: prevent memory leak in mlx5_fpga_conn_create_cq
	scsi: qla2xxx: fixup incorrect usage of host_byte
	RDMA/uverbs: Prevent potential underflow
	net: openvswitch: free vport unless register_netdevice() succeeds
	scsi: lpfc: Honor module parameter lpfc_use_adisc
	scsi: qla2xxx: Initialized mailbox to prevent driver load failure
	netfilter: nf_flow_table: set timeout before insertion into hashes
	ipvs: don't ignore errors in case refcounting ip_vs module fails
	ipvs: move old_secure_tcp into struct netns_ipvs
	bonding: fix unexpected IFF_BONDING bit unset
	macsec: fix refcnt leak in module exit routine
	usb: fsl: Check memory resource before releasing it
	usb: gadget: udc: atmel: Fix interrupt storm in FIFO mode.
	usb: gadget: composite: Fix possible double free memory bug
	usb: dwc3: pci: prevent memory leak in dwc3_pci_probe
	usb: gadget: configfs: fix concurrent issue between composite APIs
	usb: dwc3: remove the call trace of USBx_GFLADJ
	perf/x86/amd/ibs: Fix reading of the IBS OpData register and thus precise RIP validity
	perf/x86/amd/ibs: Handle erratum #420 only on the affected CPU family (10h)
	perf/x86/uncore: Fix event group support
	USB: Skip endpoints with 0 maxpacket length
	USB: ldusb: use unsigned size format specifiers
	usbip: tools: Fix read_usb_vudc_device() error path handling
	RDMA/iw_cxgb4: Avoid freeing skb twice in arp failure case
	RDMA/hns: Prevent memory leaks of eq->buf_list
	scsi: qla2xxx: stop timer in shutdown path
	nvme-multipath: fix possible io hang after ctrl reconnect
	fjes: Handle workqueue allocation failure
	net: hisilicon: Fix "Trying to free already-free IRQ"
	net: mscc: ocelot: fix vlan_filtering when enslaving to bridge before link is up
	net: mscc: ocelot: refuse to overwrite the port's native vlan
	iommu/amd: Apply the same IVRS IOAPIC workaround to Acer Aspire A315-41
	drm/amdgpu: If amdgpu_ib_schedule fails return back the error.
	drm/amd/display: Passive DP->HDMI dongle detection fix
	hv_netvsc: Fix error handling in netvsc_attach()
	usb: dwc3: gadget: fix race when disabling ep with cancelled xfers
	NFSv4: Don't allow a cached open with a revoked delegation
	net: ethernet: arc: add the missed clk_disable_unprepare
	igb: Fix constant media auto sense switching when no cable is connected
	e1000: fix memory leaks
	pinctrl: intel: Avoid potential glitches if pin is in GPIO mode
	ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
	pinctrl: cherryview: Fix irq_valid_mask calculation
	blkcg: make blkcg_print_stat() print stats only for online blkgs
	iio: imu: mpu6050: Add support for the ICM 20602 IMU
	iio: imu: inv_mpu6050: fix no data on MPU6050
	mm/filemap.c: don't initiate writeback if mapping has no dirty pages
	cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
	usbip: Fix free of unallocated memory in vhci tx
	netfilter: ipset: Copy the right MAC address in hash:ip,mac IPv6 sets
	net: prevent load/store tearing on sk->sk_stamp
	iio: imu: mpu6050: Fix FIFO layout for ICM20602
	vsock/virtio: fix sock refcnt holding during the shutdown
	drm/i915: Rename gen7 cmdparser tables
	drm/i915: Disable Secure Batches for gen6+
	drm/i915: Remove Master tables from cmdparser
	drm/i915: Add support for mandatory cmdparsing
	drm/i915: Support ro ppgtt mapped cmdparser shadow buffers
	drm/i915: Allow parsing of unsized batches
	drm/i915: Add gen9 BCS cmdparsing
	drm/i915/cmdparser: Use explicit goto for error paths
	drm/i915/cmdparser: Add support for backward jumps
	drm/i915/cmdparser: Ignore Length operands during command matching
	drm/i915: Lower RM timeout to avoid DSI hard hangs
	drm/i915/gen8+: Add RC6 CTX corruption WA
	drm/i915/cmdparser: Fix jump whitelist clearing
	KVM: x86: use Intel speculation bugs and features as derived in generic x86 code
	x86/msr: Add the IA32_TSX_CTRL MSR
	x86/cpu: Add a helper function x86_read_arch_cap_msr()
	x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
	x86/speculation/taa: Add mitigation for TSX Async Abort
	x86/speculation/taa: Add sysfs reporting for TSX Async Abort
	kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
	x86/tsx: Add "auto" option to the tsx= cmdline parameter
	x86/speculation/taa: Add documentation for TSX Async Abort
	x86/tsx: Add config options to set tsx=on|off|auto
	x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
	x86/bugs: Add ITLB_MULTIHIT bug infrastructure
	x86/cpu: Add Tremont to the cpu vulnerability whitelist
	cpu/speculation: Uninline and export CPU mitigations helpers
	Documentation: Add ITLB_MULTIHIT documentation
	kvm: x86, powerpc: do not allow clearing largepages debugfs entry
	kvm: Convert kvm_lock to a mutex
	kvm: mmu: Do not release the page inside mmu_set_spte()
	KVM: x86: make FNAME(fetch) and __direct_map more similar
	KVM: x86: remove now unneeded hugepage gfn adjustment
	KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ON
	KVM: x86: add tracepoints around __direct_map and FNAME(fetch)
	KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active
	kvm: mmu: ITLB_MULTIHIT mitigation
	kvm: Add helper function for creating VM worker threads
	kvm: x86: mmu: Recovery of shattered NX large pages
	Linux 4.19.84

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7a820f00c4b868ed677bb49613f835b7e67a3a06
2019-11-14 14:15:21 +08:00
Mel Gorman
7dfa51beac mm, meminit: recalculate pcpu batch and high limits after init completes
commit 3e8fc0075e24338b1117cdff6a79477427b8dbed upstream.

Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page
allocator (pcpu) calculates the number of pages allocated/freed in
batches as well as the maximum number of pages allowed on a per-cpu
list.  As zone->managed_pages is not up to date yet, the pcpu
initialisation calculates inappropriately low batch and high values.

This increases zone lock contention quite severely in some cases with
the degree of severity depending on how many CPUs share a local zone and
the size of the zone.  A private report indicated that kernel build
times were excessive with extremely high system CPU usage.  A perf
profile indicated that a large chunk of time was lost on zone->lock
contention.

This patch recalculates the pcpu batch and high values after deferred
initialisation completes for every populated zone in the system.  It was
tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
workload -- allmodconfig and all available CPUs.

mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.

kernbench
                                5.4.0-rc3              5.4.0-rc3
                                  vanilla           resetpcpu-v2
Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)

                   5.4.0-rc3    5.4.0-rc3
                     vanilla resetpcpu-v2
Duration User       39766.24     49221.79
Duration System     44298.10     13361.67
Duration Elapsed      519.11       388.87

The patch reduces system CPU usage by 69.86% and total build time by
26.65%.  The variance of system CPU usage is also much reduced.

Before, this was the breakdown of batch and high values over all zones
was:

    256               batch: 1
    256               batch: 63
    512               batch: 7
    256               high:  0
    256               high:  378
    512               high:  42

512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
the patch:

    256               batch: 1
    768               batch: 63
    256               high:  0
    768               high:  378

[mgorman@techsingularity.net: fix merge/linkage snafu]
  Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12 19:20:35 +01:00
Andrey Konovalov
be21fa3044 UPSTREAM: kasan, mm, arm64: tag non slab memory allocated via pagealloc
(Upstream commit 2813b9c0296259fb11e75c839bab2d958ba4f96c).

Tag-based KASAN doesn't check memory accesses through pointers tagged with
0xff.  When page_address is used to get pointer to memory that corresponds
to some page, the tag of the resulting pointer gets set to 0xff, even
though the allocated memory might have been tagged differently.

For slab pages it's impossible to recover the correct tag to return from
page_address, since the page might contain multiple slab objects tagged
with different values, and we can't know in advance which one of them is
going to get accessed.  For non slab pages however, we can recover the tag
in page_address, since the whole page was marked with the same tag.

This patch adds tagging to non slab memory allocated with pagealloc.  To
set the tag of the pointer returned from page_address, the tag gets stored
to page->flags when the memory gets allocated.

Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Bug: 128674696
Change-Id: I500bdf42462fee0ee14495a6be51815e7e44460f
2019-09-24 17:44:13 -07:00
Alexander Potapenko
a5587d8fce UPSTREAM: mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Upstream commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
and init_on_free=1 boot options").

Patch series "add init_on_alloc/init_on_free boot options", v10.

Provide init_on_alloc and init_on_free boot options.

These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.

Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.

Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.

As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations.  There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.

This patch (of 2):

The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.

This is expected to be on-by-default on Android and Chrome OS.  And it
gives the opportunity for anyone else to use it under distros too via the
boot args.  (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)

init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes.  Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.

init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion.  This helps to ensure sensitive data
doesn't leak via use-after-free accesses.

Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory.  The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
zero-initialized to preserve their semantics.

Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.

If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.

Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.

The new features are also going to pave the way for hardware memory
tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects.  With MTE, tagging will have the
same cost as memory initialization.

Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized.  There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.

[glider@google.com: v8]
  Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
  Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
  Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Change-Id: If0620a6a8aed34c21e98458c965e94f5a9dfd297
Bug: 138435492
Test: Boot cuttlefish with and without
Test:   CONFIG_INIT_ON_ALLOC_DEFAULT_ON/CONFIG_INIT_ON_FREE_DEFAULT_ON
Signed-off-by: Alexander Potapenko <glider@google.com>
2019-08-30 11:58:12 +02:00
Greg Kroah-Hartman
d1f7f3be99 Merge 4.19.51 into android-4.19
Changes in 4.19.51
	rapidio: fix a NULL pointer dereference when create_workqueue() fails
	fs/fat/file.c: issue flush after the writeback of FAT
	sysctl: return -EINVAL if val violates minmax
	ipc: prevent lockup on alloc_msg and free_msg
	drm/pl111: Initialize clock spinlock early
	ARM: prevent tracing IPI_CPU_BACKTRACE
	mm/hmm: select mmu notifier when selecting HMM
	hugetlbfs: on restore reserve error path retain subpool reservation
	mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE
	mm/cma.c: fix crash on CMA allocation if bitmap allocation fails
	initramfs: free initrd memory if opening /initrd.image fails
	mm/cma.c: fix the bitmap status to show failed allocation reason
	mm: page_mkclean vs MADV_DONTNEED race
	mm/cma_debug.c: fix the break condition in cma_maxchunk_get()
	mm/slab.c: fix an infinite loop in leaks_show()
	kernel/sys.c: prctl: fix false positive in validate_prctl_map()
	thermal: rcar_gen3_thermal: disable interrupt in .remove
	drivers: thermal: tsens: Don't print error message on -EPROBE_DEFER
	mfd: tps65912-spi: Add missing of table registration
	mfd: intel-lpss: Set the device in reset state when init
	drm/nouveau/disp/dp: respect sink limits when selecting failsafe link configuration
	mfd: twl6040: Fix device init errors for ACCCTL register
	perf/x86/intel: Allow PEBS multi-entry in watermark mode
	drm/nouveau/kms/gf119-gp10x: push HeadSetControlOutputResource() mthd when encoders change
	drm/bridge: adv7511: Fix low refresh rate selection
	objtool: Don't use ignore flag for fake jumps
	drm/nouveau/kms/gv100-: fix spurious window immediate interlocks
	bpf: fix undefined behavior in narrow load handling
	EDAC/mpc85xx: Prevent building as a module
	pwm: meson: Use the spin-lock only to protect register modifications
	mailbox: stm32-ipcc: check invalid irq
	ntp: Allow TAI-UTC offset to be set to zero
	f2fs: fix to avoid panic in do_recover_data()
	f2fs: fix to avoid panic in f2fs_inplace_write_data()
	f2fs: fix to avoid panic in f2fs_remove_inode_page()
	f2fs: fix to do sanity check on free nid
	f2fs: fix to clear dirty inode in error path of f2fs_iget()
	f2fs: fix to avoid panic in dec_valid_block_count()
	f2fs: fix to use inline space only if inline_xattr is enable
	f2fs: fix to do sanity check on valid block count of segment
	f2fs: fix to do checksum even if inode page is uptodate
	percpu: remove spurious lock dependency between percpu and sched
	configfs: fix possible use-after-free in configfs_register_group
	uml: fix a boot splat wrt use of cpu_all_mask
	PCI: dwc: Free MSI in dw_pcie_host_init() error path
	PCI: dwc: Free MSI IRQ page in dw_pcie_free_msi()
	ovl: do not generate duplicate fsnotify events for "fake" path
	mmc: mmci: Prevent polling for busy detection in IRQ context
	netfilter: nf_flow_table: fix missing error check for rhashtable_insert_fast
	netfilter: nf_conntrack_h323: restore boundary check correctness
	mips: Make sure dt memory regions are valid
	netfilter: nf_tables: fix base chain stat rcu_dereference usage
	watchdog: imx2_wdt: Fix set_timeout for big timeout values
	watchdog: fix compile time error of pretimeout governors
	blk-mq: move cancel of requeue_work into blk_mq_release
	iommu/vt-d: Set intel_iommu_gfx_mapped correctly
	misc: pci_endpoint_test: Fix test_reg_bar to be updated in pci_endpoint_test
	PCI: designware-ep: Use aligned ATU window for raising MSI interrupts
	nvme-pci: unquiesce admin queue on shutdown
	nvme-pci: shutdown on timeout during deletion
	netfilter: nf_flow_table: check ttl value in flow offload data path
	netfilter: nf_flow_table: fix netdev refcnt leak
	ALSA: hda - Register irq handler after the chip initialization
	nvmem: core: fix read buffer in place
	nvmem: sunxi_sid: Support SID on A83T and H5
	fuse: retrieve: cap requested size to negotiated max_write
	nfsd: allow fh_want_write to be called twice
	nfsd: avoid uninitialized variable warning
	vfio: Fix WARNING "do not call blocking ops when !TASK_RUNNING"
	iommu/arm-smmu-v3: Don't disable SMMU in kdump kernel
	switchtec: Fix unintended mask of MRPC event
	net: thunderbolt: Unregister ThunderboltIP protocol handler when suspending
	x86/PCI: Fix PCI IRQ routing table memory leak
	i40e: Queues are reserved despite "Invalid argument" error
	platform/chrome: cros_ec_proto: check for NULL transfer function
	PCI: keystone: Prevent ARM32 specific code to be compiled for ARM64
	soc: mediatek: pwrap: Zero initialize rdata in pwrap_init_cipher
	clk: rockchip: Turn on "aclk_dmac1" for suspend on rk3288
	soc: rockchip: Set the proper PWM for rk3288
	ARM: dts: imx51: Specify IMX5_CLK_IPG as "ahb" clock to SDMA
	ARM: dts: imx50: Specify IMX5_CLK_IPG as "ahb" clock to SDMA
	ARM: dts: imx53: Specify IMX5_CLK_IPG as "ahb" clock to SDMA
	ARM: dts: imx6sx: Specify IMX6SX_CLK_IPG as "ahb" clock to SDMA
	ARM: dts: imx6sll: Specify IMX6SLL_CLK_IPG as "ipg" clock to SDMA
	ARM: dts: imx7d: Specify IMX7D_CLK_IPG as "ipg" clock to SDMA
	ARM: dts: imx6ul: Specify IMX6UL_CLK_IPG as "ipg" clock to SDMA
	ARM: dts: imx6sx: Specify IMX6SX_CLK_IPG as "ipg" clock to SDMA
	ARM: dts: imx6qdl: Specify IMX6QDL_CLK_IPG as "ipg" clock to SDMA
	PCI: rpadlpar: Fix leaked device_node references in add/remove paths
	drm/amd/display: Use plane->color_space for dpp if specified
	ARM: OMAP2+: pm33xx-core: Do not Turn OFF CEFUSE as PPA may be using it
	platform/x86: intel_pmc_ipc: adding error handling
	power: supply: max14656: fix potential use-before-alloc
	net: hns3: return 0 and print warning when hit duplicate MAC
	PCI: rcar: Fix a potential NULL pointer dereference
	PCI: rcar: Fix 64bit MSI message address handling
	scsi: qla2xxx: Reset the FCF_ASYNC_{SENT|ACTIVE} flags
	video: hgafb: fix potential NULL pointer dereference
	video: imsttfb: fix potential NULL pointer dereferences
	block, bfq: increase idling for weight-raised queues
	PCI: xilinx: Check for __get_free_pages() failure
	gpio: gpio-omap: add check for off wake capable gpios
	ice: Add missing case in print_link_msg for printing flow control
	dmaengine: idma64: Use actual device for DMA transfers
	pwm: tiehrpwm: Update shadow register for disabling PWMs
	ARM: dts: exynos: Always enable necessary APIO_1V8 and ABB_1V8 regulators on Arndale Octa
	pwm: Fix deadlock warning when removing PWM device
	ARM: exynos: Fix undefined instruction during Exynos5422 resume
	usb: typec: fusb302: Check vconn is off when we start toggling
	soc: renesas: Identify R-Car M3-W ES1.3
	gpio: vf610: Do not share irq_chip
	percpu: do not search past bitmap when allocating an area
	Revert "Bluetooth: Align minimum encryption key size for LE and BR/EDR connections"
	Revert "drm/nouveau: add kconfig option to turn off nouveau legacy contexts. (v3)"
	ovl: check the capability before cred overridden
	ovl: support stacked SEEK_HOLE/SEEK_DATA
	drm/vc4: fix fb references in async update
	ALSA: seq: Cover unsubscribe_port() in list_mutex
	Linux 4.19.51

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-06-15 16:12:59 +02:00
Linxu Fang
5094a85d6d mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE
[ Upstream commit 299c83dce9ea3a79bb4b5511d2cb996b6b8e5111 ]

342332e6a9 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
later patches rewrote the calculation of node spanned pages.

e506b99696 ("mem-hotplug: fix node spanned pages when we have a movable
node"), but the current code still has problems,

When we have a node with only zone_movable and the node id is not zero,
the size of node spanned pages is double added.

That's because we have an empty normal zone, and zone_start_pfn or
zone_end_pfn is not between arch_zone_lowest_possible_pfn and
arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
range just like the commit <96e907d13602> (bootmem: Reimplement
__absent_pages_in_range() using for_each_mem_pfn_range()).

e.g.
Zone ranges:
  DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
  Normal   [mem 0x0000000100000000-0x000000023fffffff]
Movable zone start for each node
  Node 0: 0x0000000100000000
  Node 1: 0x0000000140000000
Early memory node ranges
  node   0: [mem 0x0000000000001000-0x000000000009efff]
  node   0: [mem 0x0000000000100000-0x00000000bffdffff]
  node   0: [mem 0x0000000100000000-0x000000013fffffff]
  node   1: [mem 0x0000000140000000-0x000000023fffffff]

node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
node 0 Normal	spanned:0	present:0	absent:0
node 0 Movable	spanned:0x40000 present:0x40000 absent:0
On node 0 totalpages(node_present_pages): 1048446
node_spanned_pages:1310719
node 1 DMA	spanned:0	    present:0		absent:0
node 1 DMA32	spanned:0	    present:0		absent:0
node 1 Normal	spanned:0x100000    present:0x100000	absent:0
node 1 Movable	spanned:0x100000    present:0x100000	absent:0
On node 1 totalpages(node_present_pages): 2097152
node_spanned_pages:2097152
Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
cma-reserved)

It shows that the current memory of node 1 is double added.
After this patch, the problem is fixed.

node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
node 0 Normal	spanned:0	present:0	absent:0
node 0 Movable	spanned:0x40000 present:0x40000 absent:0
On node 0 totalpages(node_present_pages): 1048446
node_spanned_pages:1310719
node 1 DMA	spanned:0	    present:0		absent:0
node 1 DMA32	spanned:0	    present:0		absent:0
node 1 Normal	spanned:0	    present:0		absent:0
node 1 Movable	spanned:0x100000    present:0x100000	absent:0
On node 1 totalpages(node_present_pages): 1048576
node_spanned_pages:1048576
memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
cma-reserved)

Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.com
Signed-off-by: Linxu Fang <fanglinxu@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-15 11:54:00 +02:00
Greg Kroah-Hartman
d885da678e Merge 4.19.34 into android-4.19
Changes in 4.19.34
	arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals
	ext4: cleanup bh release code in ext4_ind_remove_space()
	tty/serial: atmel: Add is_half_duplex helper
	tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped
	CIFS: fix POSIX lock leak and invalid ptr deref
	h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux-
	f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
	f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
	tracing: kdb: Fix ftdump to not sleep
	net/mlx5: Avoid panic when setting vport rate
	net/mlx5: Avoid panic when setting vport mac, getting vport config
	gpio: gpio-omap: fix level interrupt idling
	include/linux/relay.h: fix percpu annotation in struct rchan
	sysctl: handle overflow for file-max
	net: stmmac: Avoid sometimes uninitialized Clang warnings
	enic: fix build warning without CONFIG_CPUMASK_OFFSTACK
	libbpf: force fixdep compilation at the start of the build
	scsi: hisi_sas: Set PHY linkrate when disconnected
	scsi: hisi_sas: Fix a timeout race of driver internal and SMP IO
	iio: adc: fix warning in Qualcomm PM8xxx HK/XOADC driver
	x86/hyperv: Fix kernel panic when kexec on HyperV
	perf c2c: Fix c2c report for empty numa node
	mm/sparse: fix a bad comparison
	mm/cma.c: cma_declare_contiguous: correct err handling
	mm/page_ext.c: fix an imbalance with kmemleak
	mm, swap: bounds check swap_info array accesses to avoid NULL derefs
	mm,oom: don't kill global init via memory.oom.group
	memcg: killed threads should not invoke memcg OOM killer
	mm, mempolicy: fix uninit memory access
	mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
	mm/slab.c: kmemleak no scan alien caches
	ocfs2: fix a panic problem caused by o2cb_ctl
	f2fs: do not use mutex lock in atomic context
	fs/file.c: initialize init_files.resize_wait
	page_poison: play nicely with KASAN
	cifs: use correct format characters
	dm thin: add sanity checks to thin-pool and external snapshot creation
	f2fs: fix to check inline_xattr_size boundary correctly
	cifs: Accept validate negotiate if server return NT_STATUS_NOT_SUPPORTED
	cifs: Fix NULL pointer dereference of devname
	netfilter: nf_tables: check the result of dereferencing base_chain->stats
	netfilter: conntrack: tcp: only close if RST matches exact sequence
	jbd2: fix invalid descriptor block checksum
	fs: fix guard_bio_eod to check for real EOD errors
	tools lib traceevent: Fix buffer overflow in arg_eval
	PCI/PME: Fix hotplug/sysfs remove deadlock in pcie_pme_remove()
	wil6210: check null pointer in _wil_cfg80211_merge_extra_ies
	mt76: fix a leaked reference by adding a missing of_node_put
	crypto: crypto4xx - add missing of_node_put after of_device_is_available
	crypto: cavium/zip - fix collision with generic cra_driver_name
	usb: chipidea: Grab the (legacy) USB PHY by phandle first
	powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables
	scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c
	kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
	powerpc/xmon: Fix opcode being uninitialized in print_insn_powerpc
	coresight: etm4x: Add support to enable ETMv4.2
	serial: 8250_pxa: honor the port number from devicetree
	ARM: 8840/1: use a raw_spinlock_t in unwind
	iommu/io-pgtable-arm-v7s: Only kmemleak_ignore L2 tables
	powerpc/hugetlb: Handle mmap_min_addr correctly in get_unmapped_area callback
	btrfs: qgroup: Make qgroup async transaction commit more aggressive
	mmc: omap: fix the maximum timeout setting
	net: dsa: mv88e6xxx: Add lockdep classes to fix false positive splat
	e1000e: Fix -Wformat-truncation warnings
	mlxsw: spectrum: Avoid -Wformat-truncation warnings
	platform/x86: ideapad-laptop: Fix no_hw_rfkill_list for Lenovo RESCUER R720-15IKBN
	platform/mellanox: mlxreg-hotplug: Fix KASAN warning
	loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
	IB/mlx4: Increase the timeout for CM cache
	clk: fractional-divider: check parent rate only if flag is set
	perf annotate: Fix getting source line failure
	ASoC: qcom: Fix of-node refcount unbalance in qcom_snd_parse_of()
	cpufreq: acpi-cpufreq: Report if CPU doesn't support boost technologies
	efi: cper: Fix possible out-of-bounds access
	s390/ism: ignore some errors during deregistration
	scsi: megaraid_sas: return error when create DMA pool failed
	scsi: fcoe: make use of fip_mode enum complete
	drm/amd/display: Clear stream->mode_changed after commit
	perf test: Fix failure of 'evsel-tp-sched' test on s390
	mwifiex: don't advertise IBSS features without FW support
	perf report: Don't shadow inlined symbol with different addr range
	SoC: imx-sgtl5000: add missing put_device()
	media: ov7740: fix runtime pm initialization
	media: sh_veu: Correct return type for mem2mem buffer helpers
	media: s5p-jpeg: Correct return type for mem2mem buffer helpers
	media: rockchip/rga: Correct return type for mem2mem buffer helpers
	media: s5p-g2d: Correct return type for mem2mem buffer helpers
	media: mx2_emmaprp: Correct return type for mem2mem buffer helpers
	media: mtk-jpeg: Correct return type for mem2mem buffer helpers
	mt76: usb: do not run mt76u_queues_deinit twice
	xen/gntdev: Do not destroy context while dma-bufs are in use
	vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
	HID: intel-ish-hid: avoid binding wrong ishtp_cl_device
	cgroup, rstat: Don't flush subtree root unless necessary
	jbd2: fix race when writing superblock
	leds: lp55xx: fix null deref on firmware load failure
	perf report: Add s390 diagnosic sampling descriptor size
	iwlwifi: pcie: fix emergency path
	ACPI / video: Refactor and fix dmi_is_desktop()
	selftests: skip seccomp get_metadata test if not real root
	kprobes: Prohibit probing on bsearch()
	kprobes: Prohibit probing on RCU debug routine
	netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
	ARM: 8833/1: Ensure that NEON code always compiles with Clang
	ARM: dts: meson8b: fix the Ethernet data line signals in eth_rgmii_pins
	ALSA: PCM: check if ops are defined before suspending PCM
	ath10k: fix shadow register implementation for WCN3990
	usb: f_fs: Avoid crash due to out-of-scope stack ptr access
	sched/topology: Fix percpu data types in struct sd_data & struct s_data
	bcache: fix input overflow to cache set sysfs file io_error_halflife
	bcache: fix input overflow to sequential_cutoff
	bcache: fix potential div-zero error of writeback_rate_i_term_inverse
	bcache: improve sysfs_strtoul_clamp()
	genirq: Avoid summation loops for /proc/stat
	net: marvell: mvpp2: fix stuck in-band SGMII negotiation
	iw_cxgb4: fix srqidx leak during connection abort
	net: phy: consider latched link-down status in polling mode
	fbdev: fbmem: fix memory access if logo is bigger than the screen
	cdrom: Fix race condition in cdrom_sysctl_register
	drm: rcar-du: add missing of_node_put
	drm/amd/display: Don't re-program planes for DPMS changes
	drm/amd/display: Disconnect mpcc when changing tg
	perf/aux: Make perf_event accessible to setup_aux()
	e1000e: fix cyclic resets at link up with active tx
	e1000e: Exclude device from suspend direct complete optimization
	platform/x86: intel_pmc_core: Fix PCH IP sts reading
	i2c: of: Try to find an I2C adapter matching the parent
	staging: spi: mt7621: Add return code check on device_reset()
	iwlwifi: mvm: fix RFH config command with >=10 CPUs
	ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe
	sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
	efi/memattr: Don't bail on zero VA if it equals the region's PA
	sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
	drm/vkms: Bugfix extra vblank frame
	ARM: dts: lpc32xx: Remove leading 0x and 0s from bindings notation
	efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted
	soc: qcom: gsbi: Fix error handling in gsbi_probe()
	mt7601u: bump supported EEPROM version
	ARM: 8830/1: NOMMU: Toggle only bits in EXC_RETURN we are really care of
	ARM: avoid Cortex-A9 livelock on tight dmb loops
	block, bfq: fix in-service-queue check for queue merging
	bpf: fix missing prototype warnings
	selftests/bpf: skip verifier tests for unsupported program types
	powerpc/64s: Clear on-stack exception marker upon exception return
	cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
	backlight: pwm_bl: Use gpiod_get_value_cansleep() to get initial state
	tty: increase the default flip buffer limit to 2*640K
	powerpc/pseries: Perform full re-add of CPU for topology update post-migration
	drm/amd/display: Enable vblank interrupt during CRC capture
	ALSA: dice: add support for Solid State Logic Duende Classic/Mini
	usb: dwc3: gadget: Fix OTG events when gadget driver isn't loaded
	platform/x86: intel-hid: Missing power button release on some Dell models
	perf script python: Use PyBytes for attr in trace-event-python
	perf script python: Add trace_context extension module to sys.modules
	media: mt9m111: set initial frame size other than 0x0
	hwrng: virtio - Avoid repeated init of completion
	soc/tegra: fuse: Fix illegal free of IO base address
	HID: intel-ish: ipc: handle PIMR before ish_wakeup also clear PISR busy_clear bit
	f2fs: UBSAN: set boolean value iostat_enable correctly
	hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable
	cpu/hotplug: Mute hotplug lockdep during init
	dmaengine: imx-dma: fix warning comparison of distinct pointer types
	dmaengine: qcom_hidma: assign channel cookie correctly
	dmaengine: qcom_hidma: initialize tx flags in hidma_prep_dma_*
	netfilter: physdev: relax br_netfilter dependency
	media: rcar-vin: Allow independent VIN link enablement
	media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration
	regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting
	pinctrl: meson: meson8b: add the eth_rxd2 and eth_rxd3 pins
	drm: Auto-set allow_fb_modifiers when given modifiers at plane init
	drm/nouveau: Stop using drm_crtc_force_disable
	x86/build: Specify elf_i386 linker emulation explicitly for i386 objects
	selinux: do not override context on context mounts
	brcmfmac: Use firmware_request_nowarn for the clm_blob
	wlcore: Fix memory leak in case wl12xx_fetch_firmware failure
	x86/build: Mark per-CPU symbols as absolute explicitly for LLD
	drm/fb-helper: fix leaks in error path of drm_fb_helper_fbdev_setup
	clk: meson: clean-up clock registration
	clk: rockchip: fix frac settings of GPLL clock for rk3328
	dmaengine: tegra: avoid overflow of byte tracking
	Input: soc_button_array - fix mapping of the 5th GPIO in a PNP0C40 device
	drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers
	net: stmmac: Avoid one more sometimes uninitialized Clang warning
	ACPI / video: Extend chassis-type detection with a "Lunch Box" check
	bcache: fix potential div-zero error of writeback_rate_p_term_inverse
	kprobes/x86: Blacklist non-attachable interrupt functions
	Linux 4.19.34

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-05 22:43:09 +02:00
Qian Cai
a6c56bf63e page_poison: play nicely with KASAN
[ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]

KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
It triggers false positives in the allocation path:

  BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
  Read of size 8 at addr ffff88881f800000 by task swapper/0
  CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
  Call Trace:
   dump_stack+0xe0/0x19a
   print_address_description.cold.2+0x9/0x28b
   kasan_report.cold.3+0x7a/0xb5
   __asan_report_load8_noabort+0x19/0x20
   memchr_inv+0x2ea/0x330
   kernel_poison_pages+0x103/0x3d5
   get_page_from_freelist+0x15e7/0x4d90

because KASAN has not yet unpoisoned the shadow page for allocation
before it checks memchr_inv() but only found a stale poison pattern.

Also, false positives in free path,

  BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
  Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
  CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
  Call Trace:
   dump_stack+0xe0/0x19a
   print_address_description.cold.2+0x9/0x28b
   kasan_report.cold.3+0x7a/0xb5
   check_memory_region+0x22d/0x250
   memset+0x28/0x40
   kernel_poison_pages+0x29e/0x3d5
   __free_pages_ok+0x75f/0x13e0

due to KASAN adds poisoned redzones around slab objects, but the page
poisoning needs to poison the whole page.

Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-04-05 22:32:59 +02:00
Greg Kroah-Hartman
bb418a146a Merge 4.19.31 into android-4.19
Changes in 4.19.31
	media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused()
	9p: use inode->i_lock to protect i_size_write() under 32-bit
	9p/net: fix memory leak in p9_client_create
	ASoC: fsl_esai: fix register setting issue in RIGHT_J mode
	ASoC: codecs: pcm186x: fix wrong usage of DECLARE_TLV_DB_SCALE()
	ASoC: codecs: pcm186x: Fix energysense SLEEP bit
	iio: adc: exynos-adc: Fix NULL pointer exception on unbind
	mei: hbm: clean the feature flags on link reset
	mei: bus: move hw module get/put to probe/release
	stm class: Fix an endless loop in channel allocation
	crypto: caam - fix hash context DMA unmap size
	crypto: ccree - fix missing break in switch statement
	crypto: caam - fixed handling of sg list
	crypto: caam - fix DMA mapping of stack memory
	crypto: ccree - fix free of unallocated mlli buffer
	crypto: ccree - unmap buffer before copying IV
	crypto: ccree - don't copy zero size ciphertext
	crypto: cfb - add missing 'chunksize' property
	crypto: cfb - remove bogus memcpy() with src == dest
	crypto: ahash - fix another early termination in hash walk
	crypto: rockchip - fix scatterlist nents error
	crypto: rockchip - update new iv to device in multiple operations
	drm/imx: ignore plane updates on disabled crtcs
	gpu: ipu-v3: Fix i.MX51 CSI control registers offset
	drm/imx: imx-ldb: add missing of_node_puts
	gpu: ipu-v3: Fix CSI offsets for imx53
	ASoC: rt5682: Correct the setting while select ASRC clk for AD/DA filter
	clocksource: timer-ti-dm: Fix pwm dmtimer usage of fck reparenting
	KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock
	arm64: dts: rockchip: fix graph_port warning on rk3399 bob kevin and excavator
	s390/dasd: fix using offset into zero size array error
	Input: pwm-vibra - prevent unbalanced regulator
	Input: pwm-vibra - stop regulator after disabling pwm, not before
	ARM: dts: Configure clock parent for pwm vibra
	ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized
	ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables
	ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check
	KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded
	arm/arm64: KVM: Allow a VCPU to fully reset itself
	arm/arm64: KVM: Don't panic on failure to properly reset system registers
	KVM: arm/arm64: vgic: Always initialize the group of private IRQs
	KVM: arm64: Forbid kprobing of the VHE world-switch code
	ASoC: samsung: Prevent clk_get_rate() calls in atomic context
	ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug
	Input: cap11xx - switch to using set_brightness_blocking()
	Input: ps2-gpio - flush TX work when closing port
	Input: matrix_keypad - use flush_delayed_work()
	mac80211: call drv_ibss_join() on restart
	mac80211: Fix Tx aggregation session tear down with ITXQs
	netfilter: compat: initialize all fields in xt_init
	blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue
	ipvs: fix dependency on nf_defrag_ipv6
	floppy: check_events callback should not return a negative number
	xprtrdma: Make sure Send CQ is allocated on an existing compvec
	NFS: Don't use page_file_mapping after removing the page
	mm/gup: fix gup_pmd_range() for dax
	Revert "mm: use early_pfn_to_nid in page_ext_init"
	scsi: qla2xxx: Fix panic from use after free in qla2x00_async_tm_cmd
	net: dsa: bcm_sf2: potential array overflow in bcm_sf2_sw_suspend()
	x86/CPU: Add Icelake model number
	mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
	net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
	i2c: cadence: Fix the hold bit setting
	i2c: bcm2835: Clear current buffer pointers and counts after a transfer
	auxdisplay: ht16k33: fix potential user-after-free on module unload
	Input: st-keyscan - fix potential zalloc NULL dereference
	clk: sunxi-ng: v3s: Fix TCON reset de-assert bit
	kallsyms: Handle too long symbols in kallsyms.c
	clk: sunxi: A31: Fix wrong AHB gate number
	esp: Skip TX bytes accounting when sending from a request socket
	ARM: 8824/1: fix a migrating irq bug when hotplug cpu
	bpf: only adjust gso_size on bytestream protocols
	bpf: fix lockdep false positive in stackmap
	af_key: unconditionally clone on broadcast
	ARM: 8835/1: dma-mapping: Clear DMA ops on teardown
	assoc_array: Fix shortcut creation
	keys: Fix dependency loop between construction record and auth key
	scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
	net: systemport: Fix reception of BPDUs
	net: dsa: bcm_sf2: Do not assume DSA master supports WoL
	pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
	qmi_wwan: apply SET_DTR quirk to Sierra WP7607
	net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
	xfrm: Fix inbound traffic via XFRM interfaces across network namespaces
	mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
	ASoC: topology: free created components in tplg load error
	qed: Fix iWARP buffer size provided for syn packet processing.
	qed: Fix iWARP syn packet mac address validation.
	ARM: dts: armada-xp: fix Armada XP boards NAND description
	arm64: Relax GIC version check during early boot
	ARM: tegra: Restore DT ABI on Tegra124 Chromebooks
	net: marvell: mvneta: fix DMA debug warning
	mm: handle lru_add_drain_all for UP properly
	tmpfs: fix link accounting when a tmpfile is linked in
	ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN
	ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
	ARC: uacces: remove lp_start, lp_end from clobber list
	ARCv2: support manual regfile save on interrupts
	ARCv2: don't assume core 0x54 has dual issue
	phonet: fix building with clang
	mac80211_hwsim: propagate genlmsg_reply return code
	bpf, lpm: fix lookup bug in map_delete_elem
	net: thunderx: make CFG_DONE message to run through generic send-ack sequence
	net: thunderx: add nicvf_send_msg_to_pf result check for set_rx_mode_task
	nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
	nfp: bpf: fix ALU32 high bits clearance bug
	bnxt_en: Fix typo in firmware message timeout logic.
	bnxt_en: Wait longer for the firmware message response to complete.
	net: set static variable an initial value in atl2_probe()
	selftests: fib_tests: sleep after changing carrier. again.
	tmpfs: fix uninitialized return value in shmem_link
	stm class: Prevent division by zero
	nfit: acpi_nfit_ctl(): Check out_obj->type in the right place
	acpi/nfit: Fix bus command validation
	nfit/ars: Attempt a short-ARS whenever the ARS state is idle at boot
	nfit/ars: Attempt short-ARS even in the no_init_ars case
	libnvdimm/label: Clear 'updating' flag after label-set update
	libnvdimm, pfn: Fix over-trim in trim_pfn_device()
	libnvdimm/pmem: Honor force_raw for legacy pmem regions
	libnvdimm: Fix altmap reservation size calculation
	fix cgroup_do_mount() handling of failure exits
	crypto: aead - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: aegis - fix handling chunked inputs
	crypto: arm/crct10dif - revert to C code for short inputs
	crypto: arm64/aes-neonbs - fix returning final keystream block
	crypto: arm64/crct10dif - revert to C code for short inputs
	crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: morus - fix handling chunked inputs
	crypto: pcbc - remove bogus memcpy()s with src == dest
	crypto: skcipher - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
	crypto: testmgr - skip crc32c context test for ahash algorithms
	crypto: x86/aegis - fix handling chunked inputs and MAY_SLEEP
	crypto: x86/aesni-gcm - fix crash on empty plaintext
	crypto: x86/morus - fix handling chunked inputs and MAY_SLEEP
	crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling
	crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine
	CIFS: Do not reset lease state to NONE on lease break
	CIFS: Do not skip SMB2 message IDs on send failures
	CIFS: Fix read after write for files with read caching
	tracing: Use strncpy instead of memcpy for string keys in hist triggers
	tracing: Do not free iter->trace in fail path of tracing_open_pipe()
	tracing/perf: Use strndup_user() instead of buggy open-coded version
	xen: fix dom0 boot on huge systems
	ACPI / device_sysfs: Avoid OF modalias creation for removed device
	mmc: sdhci-esdhc-imx: fix HS400 timing issue
	mmc:fix a bug when max_discard is 0
	netfilter: ipt_CLUSTERIP: fix warning unused variable cn
	spi: ti-qspi: Fix mmap read when more than one CS in use
	spi: pxa2xx: Setup maximum supported DMA transfer length
	regulator: s2mps11: Fix steps for buck7, buck8 and LDO35
	regulator: max77620: Initialize values for DT properties
	regulator: s2mpa01: Fix step values for some LDOs
	clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR
	clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown
	clocksource/drivers/arch_timer: Workaround for Allwinner A64 timer instability
	s390/setup: fix early warning messages
	s390/virtio: handle find on invalid queue gracefully
	scsi: virtio_scsi: don't send sc payload with tmfs
	scsi: aacraid: Fix performance issue on logical drives
	scsi: sd: Optimal I/O size should be a multiple of physical block size
	scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock
	scsi: qla2xxx: Fix LUN discovery if loop id is not assigned yet by firmware
	fs/devpts: always delete dcache dentry-s in dput()
	splice: don't merge into linked buffers
	ovl: During copy up, first copy up data and then xattrs
	ovl: Do not lose security.capability xattr over metadata file copy-up
	m68k: Add -ffreestanding to CFLAGS
	Btrfs: setup a nofs context for memory allocation at btrfs_create_tree()
	Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
	btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
	Btrfs: fix corruption reading shared and compressed extents after hole punching
	soc: qcom: rpmh: Avoid accessing freed memory from batch API
	libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer
	irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
	irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
	x86/kprobes: Prohibit probing on optprobe template code
	cpufreq: kryo: Release OPP tables on module removal
	cpufreq: tegra124: add missing of_node_put()
	cpufreq: pxa2xx: remove incorrect __init annotation
	ext4: fix check of inode in swap_inode_boot_loader
	ext4: cleanup pagecache before swap i_data
	ext4: update quota information while swapping boot loader inode
	ext4: add mask of ext4 flags to swap
	ext4: fix crash during online resizing
	PCI/ASPM: Use LTR if already enabled by platform
	PCI/DPC: Fix print AER status in DPC event handling
	PCI: dwc: skip MSI init if MSIs have been explicitly disabled
	IB/hfi1: Close race condition on user context disable and close
	cxl: Wrap iterations over afu slices inside 'afu_list_lock'
	ext2: Fix underflow in ext2_max_size()
	clk: uniphier: Fix update register for CPU-gear
	clk: clk-twl6040: Fix imprecise external abort for pdmclk
	clk: samsung: exynos5: Fix possible NULL pointer exception on platform_device_alloc() failure
	clk: samsung: exynos5: Fix kfree() of const memory on setting driver_override
	clk: ingenic: Fix round_rate misbehaving with non-integer dividers
	clk: ingenic: Fix doc of ingenic_cgu_div_info
	usb: chipidea: tegra: Fix missed ci_hdrc_remove_device()
	usb: typec: tps6598x: handle block writes separately with plain-I2C adapters
	dmaengine: usb-dmac: Make DMAC system sleep callbacks explicit
	mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
	mm/vmalloc: fix size check for remap_vmalloc_range_partial()
	mm/memory.c: do_fault: avoid usage of stale vm_area_struct
	kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
	device property: Fix the length used in PROPERTY_ENTRY_STRING()
	intel_th: Don't reference unassigned outputs
	parport_pc: fix find_superio io compare code, should use equal test.
	i2c: tegra: fix maximum transfer size
	media: i2c: ov5640: Fix post-reset delay
	gpio: pca953x: Fix dereference of irq data in shutdown
	can: flexcan: FLEXCAN_IFLAG_MB: add () around macro argument
	drm/i915: Relax mmap VMA check
	bpf: only test gso type on gso packets
	serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO
	serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart
	serial: 8250_pci: Fix number of ports for ACCES serial cards
	serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup()
	jbd2: clear dirty flag when revoking a buffer from an older transaction
	jbd2: fix compile warning when using JBUFFER_TRACE
	selinux: add the missing walk_size + len check in selinux_sctp_bind_connect
	security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock
	powerpc/32: Clear on-stack exception marker upon exception return
	powerpc/wii: properly disable use of BATs when requested.
	powerpc/powernv: Make opal log only readable by root
	powerpc/83xx: Also save/restore SPRG4-7 during suspend
	powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit
	powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest
	powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning
	powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
	powerpc/traps: fix recoverability of machine check handling on book3s/32
	powerpc/traps: Fix the message printed when stack overflows
	ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify
	arm64: Fix HCR.TGE status for NMI contexts
	arm64: debug: Ensure debug handlers check triggering exception level
	arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
	ipmi_si: fix use-after-free of resource->name
	dm: fix to_sector() for 32bit
	dm integrity: limit the rate of error messages
	mfd: sm501: Fix potential NULL pointer dereference
	cpcap-charger: generate events for userspace
	NFS: Fix I/O request leakages
	NFS: Fix an I/O request leakage in nfs_do_recoalesce
	NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
	nfsd: fix performance-limiting session calculation
	nfsd: fix memory corruption caused by readdir
	nfsd: fix wrong check in write_v4_end_grace()
	NFSv4.1: Reinitialise sequence results before retransmitting a request
	svcrpc: fix UDP on servers with lots of threads
	PM / wakeup: Rework wakeup source timer cancellation
	bcache: never writeback a discard operation
	stable-kernel-rules.rst: add link to networking patch queue
	vt: perform safe console erase in the right order
	x86/unwind/orc: Fix ORC unwind table alignment
	perf intel-pt: Fix CYC timestamp calculation after OVF
	perf tools: Fix split_kallsyms_for_kcore() for trampoline symbols
	perf auxtrace: Define auxtrace record alignment
	perf intel-pt: Fix overlap calculation for padding
	perf/x86/intel/uncore: Fix client IMC events return huge result
	perf intel-pt: Fix divide by zero when TSC is not available
	md: Fix failed allocation of md_register_thread
	tpm/tpm_crb: Avoid unaligned reads in crb_recv()
	tpm: Unify the send callback behaviour
	rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
	media: imx: prpencvf: Stop upstream before disabling IDMA channel
	media: lgdt330x: fix lock status reporting
	media: uvcvideo: Avoid NULL pointer dereference at the end of streaming
	media: vimc: Add vimc-streamer for stream control
	media: imx: csi: Disable CSI immediately after last EOF
	media: imx: csi: Stop upstream before disabling IDMA channel
	drm/fb-helper: generic: Fix drm_fbdev_client_restore()
	drm/radeon/evergreen_cs: fix missing break in switch statement
	drm/amd/powerplay: correct power reading on fiji
	drm/amd/display: don't call dm_pp_ function from an fpu block
	KVM: Call kvm_arch_memslots_updated() before updating memslots
	KVM: x86/mmu: Detect MMIO generation wrap in any address space
	KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
	KVM: nVMX: Sign extend displacements of VMX instr's mem operands
	KVM: nVMX: Apply addr size mask to effective address for VMX instructions
	KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
	bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata
	s390/setup: fix boot crash for machine without EDAT-1
	Linux 4.19.31

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-03-23 21:13:30 +01:00
Jann Horn
33e83ea302 mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
[ Upstream commit 2c2ade81741c66082f8211f0b96cf509cc4c0218 ]

The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc->size, so that's
the bias used.

However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset < 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc->size+1`.

Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.

To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset < 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:46 +01:00
Johannes Weiner
e550f94252 UPSTREAM: psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:25:27 -07:00
Greg Kroah-Hartman
6e0411bdc2 Merge 4.19.21 into android-4.19
Changes in 4.19.21
	devres: Align data[] to ARCH_KMALLOC_MINALIGN
	drm/bufs: Fix Spectre v1 vulnerability
	staging: iio: adc: ad7280a: handle error from __ad7280_read32()
	drm/vgem: Fix vgem_init to get drm device available.
	pinctrl: bcm2835: Use raw spinlock for RT compatibility
	ASoC: Intel: mrfld: fix uninitialized variable access
	gpiolib: Fix possible use after free on label
	drm/sun4i: Initialize registers in tcon-top driver
	genirq/affinity: Spread IRQs to all available NUMA nodes
	gpu: ipu-v3: image-convert: Prevent race between run and unprepare
	nds32: Fix gcc 8.0 compiler option incompatible.
	wil6210: fix reset flow for Talyn-mb
	wil6210: fix memory leak in wil_find_tx_bcast_2
	ath10k: assign 'n_cipher_suites' for WCN3990
	ath9k: dynack: use authentication messages for 'late' ack
	scsi: lpfc: Correct LCB RJT handling
	scsi: mpt3sas: Call sas_remove_host before removing the target devices
	scsi: lpfc: Fix LOGO/PLOGI handling when triggerd by ABTS Timeout event
	ARM: 8808/1: kexec:offline panic_smp_self_stop CPU
	clk: boston: fix possible memory leak in clk_boston_setup()
	dlm: Don't swamp the CPU with callbacks queued during recovery
	x86/PCI: Fix Broadcom CNB20LE unintended sign extension (redux)
	powerpc/pseries: add of_node_put() in dlpar_detach_node()
	crypto: aes_ti - disable interrupts while accessing S-box
	drm/vc4: ->x_scaling[1] should never be set to VC4_SCALING_NONE
	serial: fsl_lpuart: clear parity enable bit when disable parity
	ptp: check gettime64 return code in PTP_SYS_OFFSET ioctl
	MIPS: Boston: Disable EG20T prefetch
	dpaa2-ptp: defer probe when portal allocation failed
	iwlwifi: fw: do not set sgi bits for HE connection
	staging:iio:ad2s90: Make probe handle spi_setup failure
	fpga: altera-cvp: Fix registration for CvP incapable devices
	Tools: hv: kvp: Fix a warning of buffer overflow with gcc 8.0.1
	fpga: altera-cvp: fix 'bad IO access' on x86_64
	vbox: fix link error with 'gcc -Og'
	platform/chrome: don't report EC_MKBP_EVENT_SENSOR_FIFO as wakeup
	i40e: prevent overlapping tx_timeout recover
	scsi: hisi_sas: change the time of SAS SSP connection
	staging: iio: ad7780: update voltage on read
	usbnet: smsc95xx: fix rx packet alignment
	drm/rockchip: fix for mailbox read size
	ARM: OMAP2+: hwmod: Fix some section annotations
	drm/amd/display: fix gamma not being applied correctly
	drm/amd/display: calculate stream->phy_pix_clk before clock mapping
	bpf: libbpf: retry map creation without the name
	net/mlx5: EQ, Use the right place to store/read IRQ affinity hint
	modpost: validate symbol names also in find_elf_symbol
	perf tools: Add Hygon Dhyana support
	soc/tegra: Don't leak device tree node reference
	media: rc: ensure close() is called on rc_unregister_device
	media: video-i2c: avoid accessing released memory area when removing driver
	media: mtk-vcodec: Release device nodes in mtk_vcodec_init_enc_pm()
	staging: erofs: fix the definition of DBG_BUGON
	clk: meson: meson8b: do not use cpu_div3 for cpu_scale_out_sel
	clk: meson: meson8b: fix the width of the cpu_scale_div clock
	clk: meson: meson8b: mark the CPU clock as CLK_IS_CRITICAL
	ptp: Fix pass zero to ERR_PTR() in ptp_clock_register
	dmaengine: xilinx_dma: Remove __aligned attribute on zynqmp_dma_desc_ll
	powerpc/32: Add .data..Lubsan_data*/.data..Lubsan_type* sections explicitly
	iio: adc: meson-saradc: check for devm_kasprintf failure
	iio: adc: meson-saradc: fix internal clock names
	iio: accel: kxcjk1013: Add KIOX010A ACPI Hardware-ID
	media: adv*/tc358743/ths8200: fill in min width/height/pixelclock
	ACPI: SPCR: Consider baud rate 0 as preconfigured state
	staging: pi433: fix potential null dereference
	f2fs: move dir data flush to write checkpoint process
	f2fs: fix race between write_checkpoint and write_begin
	f2fs: fix wrong return value of f2fs_acl_create
	i2c: sh_mobile: add support for r8a77990 (R-Car E3)
	arm64: io: Ensure calls to delay routines are ordered against prior readX()
	net: aquantia: return 'err' if set MPI_DEINIT state fails
	sunvdc: Do not spin in an infinite loop when vio_ldc_send() returns EAGAIN
	soc: bcm: brcmstb: Don't leak device tree node reference
	nfsd4: fix crash on writing v4_end_grace before nfsd startup
	drm: Clear state->acquire_ctx before leaving drm_atomic_helper_commit_duplicated_state()
	perf: arm_spe: handle devm_kasprintf() failure
	arm64: io: Ensure value passed to __iormb() is held in a 64-bit register
	Thermal: do not clear passive state during system sleep
	thermal: Fix locking in cooling device sysfs update cur_state
	firmware/efi: Add NULL pointer checks in efivars API functions
	s390/zcrypt: improve special ap message cmd handling
	mt76x0: dfs: fix IBI_R11 configuration on non-radar channels
	arm64: ftrace: don't adjust the LR value
	drm/v3d: Fix prime imports of buffers from other drivers.
	ARM: dts: mmp2: fix TWSI2
	ARM: dts: aspeed: add missing memory unit-address
	x86/fpu: Add might_fault() to user_insn()
	media: i2c: TDA1997x: select CONFIG_HDMI
	media: DaVinci-VPBE: fix error handling in vpbe_initialize()
	smack: fix access permissions for keyring
	xtensa: xtfpga.dtsi: fix dtc warnings about SPI
	usb: dwc3: Correct the logic for checking TRB full in __dwc3_prepare_one_trb()
	usb: dwc2: Disable power down feature on Samsung SoCs
	usb: hub: delay hub autosuspend if USB3 port is still link training
	timekeeping: Use proper seqcount initializer
	usb: mtu3: fix the issue about SetFeature(U1/U2_Enable)
	clk: sunxi-ng: a33: Set CLK_SET_RATE_PARENT for all audio module clocks
	media: imx274: select REGMAP_I2C
	drm/amdgpu/powerplay: fix clock stretcher limits on polaris (v2)
	tipc: fix node keep alive interval calculation
	driver core: Move async_synchronize_full call
	kobject: return error code if writing /sys/.../uevent fails
	IB/hfi1: Unreserve a reserved request when it is completed
	usb: dwc3: trace: add missing break statement to make compiler happy
	gpio: mt7621: report failure of devm_kasprintf()
	gpio: mt7621: pass mediatek_gpio_bank_probe() failure up the stack
	pinctrl: sx150x: handle failure case of devm_kstrdup
	iommu/amd: Fix amd_iommu=force_isolation
	ARM: dts: Fix OMAP4430 SDP Ethernet startup
	mips: bpf: fix encoding bug for mm_srlv32_op
	media: coda: fix H.264 deblocking filter controls
	ARM: dts: Fix up the D-Link DIR-685 MTD partition info
	watchdog: renesas_wdt: don't set divider while watchdog is running
	ARM: dts: imx51-zii-rdu1: Do not specify "power-gpio" for hpa1
	usb: dwc3: gadget: Disable CSP for stream OUT ep
	iommu/arm-smmu-v3: Avoid memory corruption from Hisilicon MSI payloads
	iommu/arm-smmu: Add support for qcom,smmu-v2 variant
	iommu/arm-smmu-v3: Use explicit mb() when moving cons pointer
	sata_rcar: fix deferred probing
	clk: imx6sl: ensure MMDC CH0 handshake is bypassed
	platform/x86: mlx-platform: Fix tachometer registers
	cpuidle: big.LITTLE: fix refcount leak
	OPP: Use opp_table->regulators to verify no regulator case
	tee: optee: avoid possible double list_del()
	drm/msm/dsi: fix dsi clock names in DSI 10nm PLL driver
	drm/msm: dpu: Only check flush register against pending flushes
	lightnvm: pblk: fix resubmission of overwritten write err lbas
	lightnvm: pblk: add lock protection to list operations
	i2c-axxia: check for error conditions first
	phy: sun4i-usb: add support for missing USB PHY index
	mlxsw: spectrum_acl: Limit priority value
	udf: Fix BUG on corrupted inode
	switchtec: Fix SWITCHTEC_IOCTL_EVENT_IDX_ALL flags overwrite
	selftests/bpf: use __bpf_constant_htons in test_prog.c
	ARM: pxa: avoid section mismatch warning
	ASoC: fsl: Fix SND_SOC_EUKREA_TLV320 build error on i.MX8M
	KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
	mmc: bcm2835: Recover from MMC_SEND_EXT_CSD
	mmc: bcm2835: reset host on timeout
	mmc: meson-mx-sdio: check devm_kasprintf for failure
	memstick: Prevent memstick host from getting runtime suspended during card detection
	mmc: sdhci-of-esdhc: Fix timeout checks
	mmc: sdhci-omap: Fix timeout checks
	mmc: sdhci-xenon: Fix timeout checks
	mmc: jz4740: Get CD/WP GPIOs from descriptors
	usb: renesas_usbhs: add support for RZ/G2E
	btrfs: harden agaist duplicate fsid on scanned devices
	serial: sh-sci: Fix locking in sci_submit_rx()
	serial: sh-sci: Resume PIO in sci_rx_interrupt() on DMA failure
	tty: serial: samsung: Properly set flags in autoCTS mode
	perf test: Fix perf_event_attr test failure
	perf dso: Fix unchecked usage of strncpy()
	perf header: Fix unchecked usage of strncpy()
	btrfs: use tagged writepage to mitigate livelock of snapshot
	perf probe: Fix unchecked usage of strncpy()
	i2c: sh_mobile: Add support for r8a774c0 (RZ/G2E)
	bnxt_en: Disable MSIX before re-reserving NQs/CMPL rings.
	tools/power/x86/intel_pstate_tracer: Fix non root execution for post processing a trace file
	livepatch: check kzalloc return values
	arm64: KVM: Skip MMIO insn after emulation
	usb: musb: dsps: fix otg state machine
	usb: musb: dsps: fix runtime pm for peripheral mode
	perf header: Fix up argument to ctime()
	perf tools: Cast off_t to s64 to avoid warning on bionic libc
	percpu: convert spin_lock_irq to spin_lock_irqsave.
	net: hns3: fix incomplete uninitialization of IRQ in the hns3_nic_uninit_vector_data()
	drm/amd/display: Add retry to read ddc_clock pin
	Bluetooth: hci_bcm: Handle deferred probing for the clock supply
	drm/amd/display: fix YCbCr420 blank color
	powerpc/uaccess: fix warning/error with access_ok()
	mac80211: fix radiotap vendor presence bitmap handling
	xfrm6_tunnel: Fix spi check in __xfrm6_tunnel_alloc_spi
	mlxsw: spectrum: Properly cleanup LAG uppers when removing port from LAG
	scsi: smartpqi: correct host serial num for ssa
	scsi: smartpqi: correct volume status
	scsi: smartpqi: increase fw status register read timeout
	cw1200: Fix concurrency use-after-free bugs in cw1200_hw_scan()
	net: hns3: add max vector number check for pf
	powerpc/perf: Fix thresholding counter data for unknown type
	iwlwifi: mvm: fix setting HE ppe FW config
	powerpc/powernv/ioda: Allocate indirect TCE levels of cached userspace addresses on demand
	mlx5: update timecounter at least twice per counter overflow
	drbd: narrow rcu_read_lock in drbd_sync_handshake
	drbd: disconnect, if the wrong UUIDs are attached on a connected peer
	drbd: skip spurious timeout (ping-timeo) when failing promote
	drbd: Avoid Clang warning about pointless switch statment
	drm/amd/display: validate extended dongle caps
	video: clps711x-fb: release disp device node in probe()
	md: fix raid10 hang issue caused by barrier
	fbdev: fbmem: behave better with small rotated displays and many CPUs
	i40e: define proper net_device::neigh_priv_len
	ice: Do not enable NAPI on q_vectors that have no rings
	igb: Fix an issue that PME is not enabled during runtime suspend
	ACPI/APEI: Clear GHES block_status before panic()
	fbdev: fbcon: Fix unregister crash when more than one framebuffer
	powerpc/mm: Fix reporting of kernel execute faults on the 8xx
	pinctrl: meson: meson8: fix the GPIO function for the GPIOAO pins
	pinctrl: meson: meson8b: fix the GPIO function for the GPIOAO pins
	KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported
	powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
	kvm: Change offset in kvm_write_guest_offset_cached to unsigned
	NFS: nfs_compare_mount_options always compare auth flavors.
	perf build: Don't unconditionally link the libbfd feature test to -liberty and -lz
	hwmon: (lm80) fix a missing check of the status of SMBus read
	hwmon: (lm80) fix a missing check of bus read in lm80 probe
	seq_buf: Make seq_buf_puts() null-terminate the buffer
	crypto: ux500 - Use proper enum in cryp_set_dma_transfer
	crypto: ux500 - Use proper enum in hash_set_dma_transfer
	MIPS: ralink: Select CONFIG_CPU_MIPSR2_IRQ_VI on MT7620/8
	cifs: check ntwrk_buf_start for NULL before dereferencing it
	f2fs: fix use-after-free issue when accessing sbi->stat_info
	um: Avoid marking pages with "changed protection"
	niu: fix missing checks of niu_pci_eeprom_read
	f2fs: fix sbi->extent_list corruption issue
	cgroup: fix parsing empty mount option string
	perf python: Do not force closing original perf descriptor in evlist.get_pollfd()
	scripts/decode_stacktrace: only strip base path when a prefix of the path
	arch/sh/boards/mach-kfr2r09/setup.c: fix struct mtd_oob_ops build warning
	ocfs2: don't clear bh uptodate for block read
	ocfs2: improve ocfs2 Makefile
	mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
	zram: fix lockdep warning of free block handling
	isdn: hisax: hfc_pci: Fix a possible concurrency use-after-free bug in HFCPCI_l1hw()
	gdrom: fix a memory leak bug
	fsl/fman: Use GFP_ATOMIC in {memac,tgec}_add_hash_mac_address()
	block/swim3: Fix -EBUSY error when re-opening device after unmount
	thermal: bcm2835: enable hwmon explicitly
	kdb: Don't back trace on a cpu that didn't round up
	PCI: imx: Enable MSI from downstream components
	thermal: generic-adc: Fix adc to temp interpolation
	HID: lenovo: Add checks to fix of_led_classdev_register
	arm64/sve: ptrace: Fix SVE_PT_REGS_OFFSET definition
	kernel/hung_task.c: break RCU locks based on jiffies
	proc/sysctl: fix return error for proc_doulongvec_minmax()
	kernel/hung_task.c: force console verbose before panic
	fs/epoll: drop ovflist branch prediction
	exec: load_script: don't blindly truncate shebang string
	kernel/kcov.c: mark write_comp_data() as notrace
	scripts/gdb: fix lx-version string output
	xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
	xfs: cancel COW blocks before swapext
	xfs: Fix error code in 'xfs_ioc_getbmap()'
	xfs: fix overflow in xfs_attr3_leaf_verify
	xfs: fix shared extent data corruption due to missing cow reservation
	xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
	xfs: delalloc -> unwritten COW fork allocation can go wrong
	fs/xfs: fix f_ffree value for statfs when project quota is set
	xfs: fix PAGE_MASK usage in xfs_free_file_space
	xfs: fix inverted return from xfs_btree_sblock_verify_crc
	thermal: hwmon: inline helpers when CONFIG_THERMAL_HWMON is not set
	dccp: fool proof ccid_hc_[rt]x_parse_options()
	enic: fix checksum validation for IPv6
	lib/test_rhashtable: Make test_insert_dup() allocate its hash table dynamically
	net: dp83640: expire old TX-skb
	net: dsa: Fix lockdep false positive splat
	net: dsa: Fix NULL checking in dsa_slave_set_eee()
	net: dsa: mv88e6xxx: Fix counting of ATU violations
	net: dsa: slave: Don't propagate flag changes on down slave interfaces
	net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
	net: systemport: Fix WoL with password after deep sleep
	rds: fix refcount bug in rds_sock_addref
	Revert "net: phy: marvell: avoid pause mode on SGMII-to-Copper for 88e151x"
	rxrpc: bad unlock balance in rxrpc_recvmsg
	sctp: check and update stream->out_curr when allocating stream_out
	sctp: walk the list of asoc safely
	skge: potential memory corruption in skge_get_regs()
	virtio_net: Account for tx bytes and packets on sending xdp_frames
	net/mlx5e: FPGA, fix Innova IPsec TX offload data path performance
	xfs: eof trim writeback mapping as soon as it is cached
	ALSA: compress: Fix stop handling on compressed capture streams
	ALSA: usb-audio: Add support for new T+A USB DAC
	ALSA: hda - Serialize codec registrations
	ALSA: hda/realtek - Fix lose hp_pins for disable auto mute
	ALSA: hda/realtek - Use a common helper for hp pin reference
	ALSA: hda/realtek - Headset microphone support for System76 darp5
	fuse: call pipe_buf_release() under pipe lock
	fuse: decrement NR_WRITEBACK_TEMP on the right page
	fuse: handle zero sized retrieve correctly
	HID: debug: fix the ring buffer implementation
	dmaengine: bcm2835: Fix interrupt race on RT
	dmaengine: bcm2835: Fix abort of transactions
	dmaengine: imx-dma: fix wrong callback invoke
	futex: Handle early deadlock return correctly
	irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID
	usb: phy: am335x: fix race condition in _probe
	usb: dwc3: gadget: Handle 0 xfer length for OUT EP
	usb: gadget: udc: net2272: Fix bitwise and boolean operations
	usb: gadget: musb: fix short isoc packets with inventra dma
	staging: speakup: fix tty-operation NULL derefs
	scsi: cxlflash: Prevent deadlock when adapter probe fails
	scsi: aic94xx: fix module loading
	KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
	kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
	KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
	cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
	perf/x86/intel/uncore: Add Node ID mask
	x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
	perf/core: Don't WARN() for impossible ring-buffer sizes
	perf tests evsel-tp-sched: Fix bitwise operator
	serial: fix race between flush_to_ldisc and tty_open
	serial: 8250_pci: Make PCI class test non fatal
	serial: sh-sci: Do not free irqs that have already been freed
	cacheinfo: Keep the old value if of_property_read_u32 fails
	IB/hfi1: Add limit test for RC/UC send via loopback
	perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu()
	ath9k: dynack: make ewma estimation faster
	ath9k: dynack: check da->enabled first in sampling routines
	Linux 4.19.21

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-02-12 20:37:21 +01:00
Waiman Long
f73c77535f mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
[ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ]

When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
pages initialization can take a long time.  Below were the reported init
times on a 8-socket 96-core 4TB IvyBridge system.

  1) Non-debug kernel without CONFIG_KASAN
     [    8.764222] node 1 initialised, 132086516 pages in 7027ms

  2) Debug kernel with CONFIG_KASAN
     [  146.288115] node 1 initialised, 132075466 pages in 143052ms

So the page init time in a debug kernel was 20X of the non-debug kernel.
The long init time can be problematic as the page initialization is done
with interrupt disabled.  In this particular case, it caused the
appearance of following warning messages as well as NMI backtraces of all
the cores that were doing the initialization.

[   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
[   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
[   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
[   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
[   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
[   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
[   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
[   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)

The long init time was mainly caused by the call to kasan_free_pages() to
poison the newly initialized pages.  On a 4TB system, we are talking about
almost 500GB of memory probably on the same node.

In reality, we may not need to poison the newly initialized pages before
they are ever allocated.  So KASAN poisoning of freed pages before the
completion of deferred memory initialization is now disabled.  Those pages
will be properly poisoned when they are allocated or freed after deferred
pages initialization is done.

With this change, the new page initialization time became:

[   21.948010] node 1 initialised, 132075466 pages in 18702ms

This was still about double the non-debug kernel time, but was much
better than before.

Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12 19:47:18 +01:00
Greg Kroah-Hartman
18ba00a34e Merge 4.19.19 into android-4.19
Changes in 4.19.19
	amd-xgbe: Fix mdio access for non-zero ports and clause 45 PHYs
	net: bridge: Fix ethernet header pointer before check skb forwardable
	net: Fix usage of pskb_trim_rcsum
	net: phy: marvell: Errata for mv88e6390 internal PHYs
	net: phy: mdio_bus: add missing device_del() in mdiobus_register() error handling
	net/sched: act_tunnel_key: fix memory leak in case of action replace
	net_sched: refetch skb protocol for each filter
	openvswitch: Avoid OOB read when parsing flow nlattrs
	vhost: log dirty page correctly
	mlxsw: pci: Increase PCI SW reset timeout
	net: ipv4: Fix memory leak in network namespace dismantle
	mlxsw: spectrum_fid: Update dummy FID index
	mlxsw: pci: Ring CQ's doorbell before RDQ's
	net/sched: cls_flower: allocate mask dynamically in fl_change()
	udp: with udp_segment release on error path
	ip6_gre: fix tunnel list corruption for x-netns
	erspan: build the header with the right proto according to erspan_ver
	net: phy: marvell: Fix deadlock from wrong locking
	ip6_gre: update version related info when changing link
	tcp: allow MSG_ZEROCOPY transmission also in CLOSE_WAIT state
	mei: me: mark LBG devices as having dma support
	mei: me: add denverton innovation engine device IDs
	USB: leds: fix regression in usbport led trigger
	USB: serial: simple: add Motorola Tetra TPG2200 device id
	USB: serial: pl2303: add new PID to support PL2303TB
	ceph: clear inode pointer when snap realm gets dropped by its inode
	ASoC: atom: fix a missing check of snd_pcm_lib_malloc_pages
	ASoC: rt5514-spi: Fix potential NULL pointer dereference
	ASoC: tlv320aic32x4: Kernel OOPS while entering DAPM standby mode
	clk: socfpga: stratix10: fix rate calculation for pll clocks
	clk: socfpga: stratix10: fix naming convention for the fixed-clocks
	inotify: Fix fd refcount leak in inotify_add_watch().
	ALSA: hda/realtek - Fix typo for ALC225 model
	ALSA: hda - Add mute LED support for HP ProBook 470 G5
	ARCv2: lib: memeset: fix doing prefetchw outside of buffer
	ARC: adjust memblock_reserve of kernel memory
	ARC: perf: map generic branches to correct hardware condition
	s390/mm: always force a load of the primary ASCE on context switch
	s390/early: improve machine detection
	s390/smp: fix CPU hotplug deadlock with CPU rescan
	misc: ibmvsm: Fix potential NULL pointer dereference
	char/mwave: fix potential Spectre v1 vulnerability
	mmc: dw_mmc-bluefield: : Fix the license information
	mmc: meson-gx: Free irq in release() callback
	staging: rtl8188eu: Add device code for D-Link DWA-121 rev B1
	tty: Handle problem if line discipline does not have receive_buf
	uart: Fix crash in uart_write and uart_put_char
	tty/n_hdlc: fix __might_sleep warning
	hv_balloon: avoid touching uninitialized struct page during tail onlining
	Drivers: hv: vmbus: Check for ring when getting debug info
	vgacon: unconfuse vc_origin when using soft scrollback
	CIFS: Fix possible hang during async MTU reads and writes
	CIFS: Fix credits calculations for reads with errors
	CIFS: Fix credit calculation for encrypted reads with errors
	CIFS: Do not reconnect TCP session in add_credits()
	smb3: add credits we receive from oplock/break PDUs
	Input: xpad - add support for SteelSeries Stratus Duo
	Input: input_event - provide override for sparc64
	Input: uinput - fix undefined behavior in uinput_validate_absinfo()
	acpi/nfit: Block function zero DSMs
	acpi/nfit: Fix command-supported detection
	scsi: ufs: Use explicit access size in ufshcd_dump_regs
	dm thin: fix passdown_double_checking_shared_status()
	dm crypt: fix parsing of extended IV arguments
	drm/amdgpu: Add APTX quirk for Lenovo laptop
	KVM: x86: Fix single-step debugging
	KVM: x86: Fix PV IPIs for 32-bit KVM host
	KVM: x86: WARN_ONCE if sending a PV IPI returns a fatal error
	kvm: x86/vmx: Use kzalloc for cached_vmcs12
	KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
	x86/pkeys: Properly copy pkey state at fork()
	x86/selftests/pkeys: Fork() to check for state being preserved
	x86/kaslr: Fix incorrect i8254 outb() parameters
	x86/entry/64/compat: Fix stack switching for XEN PV
	posix-cpu-timers: Unbreak timer rearming
	net: sun: cassini: Cleanup license conflict
	irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
	can: dev: __can_get_echo_skb(): fix bogous check for non-existing skb by removing it
	can: bcm: check timer values before ktime conversion
	can: flexcan: fix NULL pointer exception during bringup
	vt: make vt_console_print() compatible with the unicode screen buffer
	vt: always call notifier with the console lock held
	vt: invoke notifier on screen size change
	drm/meson: Fix atomic mode switching regression
	bpf: improve verifier branch analysis
	bpf: add per-insn complexity limit
	bpf: move {prev_,}insn_idx into verifier env
	bpf: move tmp variable into ax register in interpreter
	bpf: enable access to ax register also from verifier rewrite
	bpf: restrict map value pointer arithmetic for unprivileged
	bpf: restrict stack pointer arithmetic for unprivileged
	bpf: restrict unknown scalars of mixed signed bounds for unprivileged
	bpf: fix check_map_access smin_value test when pointer contains offset
	bpf: prevent out of bounds speculation on pointer arithmetic
	bpf: fix sanitation of alu op with pointer / scalar type from different paths
	bpf: fix inner map masking to prevent oob under speculation
	s390/smp: Fix calling smp_call_ipl_cpu() from ipl CPU
	nvmet-rdma: Add unlikely for response allocated check
	nvmet-rdma: fix null dereference under heavy load
	Revert "mm, memory_hotplug: initialize struct pages for the full memory section"
	usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanup
	ide: fix a typo in the settings proc file name
	Input: input_event - fix the CONFIG_SPARC64 mixup
	Linux 4.19.19

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-01-31 08:29:40 +01:00
Michal Hocko
6bab957396 Revert "mm, memory_hotplug: initialize struct pages for the full memory section"
commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream.

This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.

The underlying assumption that one sparse section belongs into a single
numa node doesn't hold really. Robert Shteynfeld has reported a boot
failure. The boot log was not captured but his memory layout is as
follows:

  Early memory node ranges
    node   1: [mem 0x0000000000001000-0x0000000000090fff]
    node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
    node   1: [mem 0x0000000100000000-0x0000001423ffffff]
    node   0: [mem 0x0000001424000000-0x0000002023ffffff]

This means that node0 starts in the middle of a memory section which is
also in node1.  memmap_init_zone tries to initialize padding of a
section even when it is outside of the given pfn range because there are
code paths (e.g.  memory hotplug) which assume that the full worth of
memory section is always initialized.

In this particular case, though, such a range is already intialized and
most likely already managed by the page allocator.  Scribbling over
those pages corrupts the internal state and likely blows up when any of
those pages gets used.

Reported-by: Robert Shteynfeld <robert.shteynfeld@gmail.com>
Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
Cc: stable@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31 08:14:41 +01:00
Greg Kroah-Hartman
a872d2d074 Merge 4.19.13 into android-4.19
Changes in 4.19.13
	iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()"
	Revert "vfs: Allow userns root to call mknod on owned filesystems."
	USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
	xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
	USB: xhci: fix 'broken_suspend' placement in struct xchi_hcd
	USB: serial: option: add GosunCn ZTE WeLink ME3630
	USB: serial: option: add HP lt4132
	USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
	USB: serial: option: add Fibocom NL668 series
	USB: serial: option: add Telit LN940 series
	ubifs: Handle re-linking of inodes correctly while recovery
	scsi: t10-pi: Return correct ref tag when queue has no integrity profile
	scsi: sd: use mempool for discard special page
	mmc: core: Reset HPI enabled state during re-init and in case of errors
	mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
	mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
	mmc: omap_hsmmc: fix DMA API warning
	gpio: max7301: fix driver for use with CONFIG_VMAP_STACK
	gpiolib-acpi: Only defer request_irq for GpioInt ACPI event handlers
	posix-timers: Fix division by zero bug
	KVM: X86: Fix NULL deref in vcpu_scan_ioapic
	kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
	KVM: Fix UAF in nested posted interrupt processing
	Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
	futex: Cure exit race
	x86/mtrr: Don't copy uninitialized gentry fields back to userspace
	x86/mm: Fix decoy address handling vs 32-bit builds
	x86/vdso: Pass --eh-frame-hdr to the linker
	x86/intel_rdt: Ensure a CPU remains online for the region's pseudo-locking sequence
	panic: avoid deadlocks in re-entrant console drivers
	mm: add mm_pxd_folded checks to pgtable_bytes accounting functions
	mm: make the __PAGETABLE_PxD_FOLDED defines non-empty
	mm: introduce mm_[p4d|pud|pmd]_folded
	xfrm_user: fix freeing of xfrm states on acquire
	rtlwifi: Fix leak of skb when processing C2H_BT_INFO
	iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares
	Revert "mwifiex: restructure rx_reorder_tbl_lock usage"
	iwlwifi: add new cards for 9560, 9462, 9461 and killer series
	media: ov5640: Fix set format regression
	mm, memory_hotplug: initialize struct pages for the full memory section
	mm: thp: fix flags for pmd migration when split
	mm, page_alloc: fix has_unmovable_pages for HugePages
	mm: don't miss the last page because of round-off error
	Input: elantech - disable elan-i2c for P52 and P72
	proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
	drm/ioctl: Fix Spectre v1 vulnerabilities
	Linux 4.19.13

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-12-29 13:46:09 +01:00
Oscar Salvador
e27666dd8f mm, page_alloc: fix has_unmovable_pages for HugePages
commit 17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4 upstream.

While playing with gigantic hugepages and memory_hotplug, I triggered
the following #PF when "cat memoryX/removable":

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  #PF error: [normal kernel read fault]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 1 PID: 1481 Comm: cat Tainted: G            E     4.20.0-rc6-mm1-1-default+ #18
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  RIP: 0010:has_unmovable_pages+0x154/0x210
  Call Trace:
   is_mem_section_removable+0x7d/0x100
   removable_show+0x90/0xb0
   dev_attr_show+0x1c/0x50
   sysfs_kf_seq_show+0xca/0x1b0
   seq_read+0x133/0x380
   __vfs_read+0x26/0x180
   vfs_read+0x89/0x140
   ksys_read+0x42/0x90
   do_syscall_64+0x5b/0x180
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason is we do not pass the Head to page_hstate(), and so, the call
to compound_order() in page_hstate() returns 0, so we end up checking
all hstates's size to match PAGE_SIZE.

Obviously, we do not find any hstate matching that size, and we return
NULL.  Then, we dereference that NULL pointer in
hugepage_migration_supported() and we got the #PF from above.

Fix that by getting the head page before calling page_hstate().

Also, since gigantic pages span several pageblocks, re-adjust the logic
for skipping pages.  While are it, we can also get rid of the
round_up().

[osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
  Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29 13:37:59 +01:00
Mikhail Zaslonko
7592dbfaf3 mm, memory_hotplug: initialize struct pages for the full memory section
commit 2830bf6f05fb3e05bc4743274b806c821807a684 upstream.

If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized.  This may lead
to VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered
by memory_hotplug sysfs handlers:

Here are the the panic examples:
 CONFIG_DEBUG_VM=y
 CONFIG_DEBUG_VM_PGFLAGS=y

 kernel parameter mem=2050M
 --------------------------
 page:000003d082008000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
 ( test_pages_in_a_zone+0xde/0x160)
   show_valid_zones+0x5c/0x190
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   test_pages_in_a_zone+0xde/0x160
 Kernel panic - not syncing: Fatal exception: panic_on_oops

 kernel parameter mem=3075M
 --------------------------
 page:000003d08300c000 is uninitialized and poisoned
 page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 Call Trace:
 ( is_mem_section_removable+0xb4/0x190)
   show_mem_removable+0x9a/0xd8
   dev_attr_show+0x34/0x70
   sysfs_kf_seq_show+0xc8/0x148
   seq_read+0x204/0x480
   __vfs_read+0x32/0x178
   vfs_read+0x82/0x138
   ksys_read+0x5a/0xb0
   system_call+0xdc/0x2d8
 Last Breaking-Event-Address:
   is_mem_section_removable+0xb4/0x190
 Kernel panic - not syncing: Fatal exception: panic_on_oops

Fix the problem by initializing the last memory section of each zone in
memmap_init_zone() till the very end, even if it goes beyond the zone end.

Michal said:

: This has alwways been problem AFAIU.  It just went unnoticed because we
: have zeroed memmaps during allocation before f7f99100d8 ("mm: stop
: zeroing memory during allocation in vmemmap") and so the above test
: would simply skip these ranges as belonging to zone 0 or provided a
: garbage.
:
: So I guess we do care for post f7f99100d8 kernels mostly and
: therefore Fixes: f7f99100d8 ("mm: stop zeroing memory during
: allocation in vmemmap")

Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
Fixes: f7f99100d8 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-29 13:37:58 +01:00
Greg Kroah-Hartman
67319b77a0 Merge 4.19.10 into android-4.19
Changes in 4.19.10
	ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
	ipv6: Check available headroom in ip6_xmit() even without options
	neighbour: Avoid writing before skb->head in neigh_hh_output()
	ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
	net: 8139cp: fix a BUG triggered by changing mtu with network traffic
	net/mlx4_core: Correctly set PFC param if global pause is turned off.
	net/mlx4_en: Change min MTU size to ETH_MIN_MTU
	net: phy: don't allow __set_phy_supported to add unsupported modes
	net: Prevent invalid access to skb->prev in __qdisc_drop_all
	net: use skb_list_del_init() to remove from RX sublists
	Revert "net/ibm/emac: wrong bit is used for STA control"
	rtnetlink: ndo_dflt_fdb_dump() only work for ARPHRD_ETHER devices
	sctp: kfree_rcu asoc
	tcp: Do not underestimate rwnd_limited
	tcp: fix NULL ref in tail loss probe
	tun: forbid iface creation with rtnl ops
	virtio-net: keep vnet header zeroed after processing XDP
	net: phy: sfp: correct store of detected link modes
	sctp: update frag_point when stream_interleave is set
	net: restore call to netdev_queue_numa_node_write when resetting XPS
	net: fix XPS static_key accounting
	ARM: OMAP2+: prm44xx: Fix section annotation on omap44xx_prm_enable_io_wakeup
	ASoC: rsnd: fixup clock start checker
	ASoC: qdsp6: q6afe: Fix wrong MI2S SD line mask
	ASoC: qdsp6: q6afe-dai: Fix the dai widgets
	staging: rtl8723bs: Fix the return value in case of error in 'rtw_wx_read32()'
	ARM: dts: am3517: Fix pinmuxing for CD on MMC1
	ARM: dts: LogicPD Torpedo: Fix mmc3_dat1 interrupt
	ARM: dts: logicpd-somlv: Fix interrupt on mmc3_dat1
	ARM: dts: am3517-som: Fix WL127x Wifi interrupt
	ARM: OMAP1: ams-delta: Fix possible use of uninitialized field
	tools: bpftool: prevent infinite loop in get_fdinfo()
	ASoC: sun8i-codec: fix crash on module removal
	arm64: dts: sdm845-mtp: Reserve reserved gpios
	sysv: return 'err' instead of 0 in __sysv_write_inode
	netfilter: nf_conncount: use spin_lock_bh instead of spin_lock
	netfilter: nf_conncount: fix list_del corruption in conn_free
	netfilter: nf_conncount: fix unexpected permanent node of list.
	netfilter: nf_tables: don't skip inactive chains during update
	selftests: add script to stress-test nft packet path vs. control plane
	perf tools: Fix crash on synthesizing the unit
	netfilter: xt_RATEEST: remove netns exit routine
	netfilter: nf_tables: fix use-after-free when deleting compat expressions
	s390/cio: Fix cleanup of pfn_array alloc failure
	s390/cio: Fix cleanup when unsupported IDA format is used
	hwmon (ina2xx) Fix NULL id pointer in probe()
	hwmon: (raspberrypi) Fix initial notify
	ASoC: rockchip: add missing slave_config setting for I2S
	ASoC: wm_adsp: Fix dma-unsafe read of scratch registers
	ASoC: Intel: Power down links before turning off display audio power
	ASoC: qcom: Set dai_link id to each dai_link
	s390/cpum_cf: Reject request for sampling in event initialization
	hwmon: (ina2xx) Fix current value calculation
	ASoC: omap-abe-twl6040: Fix missing audio card caused by deferred probing
	ASoC: dapm: Recalculate audio map forcely when card instantiated
	spi: omap2-mcspi: Add missing suspend and resume calls
	hwmon: (mlxreg-fan) Fix macros for tacho fault reading
	bpf: allocate local storage buffers using GFP_ATOMIC
	aio: fix failure to put the file pointer
	netfilter: xt_hashlimit: fix a possible memory leak in htable_create()
	hwmon: (w83795) temp4_type has writable permission
	perf tools: Restore proper cwd on return from mnt namespace
	PCI: imx6: Fix link training status detection in link up check
	ASoC: acpi: fix: continue searching when machine is ignored
	objtool: Fix double-free in .cold detection error path
	objtool: Fix segfault in .cold detection with -ffunction-sections
	phy: qcom-qusb2: Use HSTX_TRIM fused value as is
	phy: qcom-qusb2: Fix HSTX_TRIM tuning with fused value for SDM845
	ARM: dts: at91: sama5d2: use the divided clock for SMC
	Btrfs: send, fix infinite loop due to directory rename dependencies
	RDMA/mlx5: Fix fence type for IB_WR_LOCAL_INV WR
	RDMA/core: Add GIDs while changing MAC addr only for registered ndev
	RDMA/bnxt_re: Fix system hang when registration with L2 driver fails
	RDMA/bnxt_re: Avoid accessing the device structure after it is freed
	RDMA/rdmavt: Fix rvt_create_ah function signature
	tools: bpftool: fix potential NULL pointer dereference in do_load
	ASoC: omap-mcbsp: Fix latency value calculation for pm_qos
	ASoC: omap-mcpdm: Add pm_qos handling to avoid under/overruns with CPU_IDLE
	ASoC: omap-dmic: Add pm_qos handling to avoid overruns with CPU_IDLE
	exportfs: do not read dentry after free
	RDMA/hns: Bugfix pbl configuration for rereg mr
	bpf: fix check of allowed specifiers in bpf_trace_printk
	fsi: master-ast-cf: select GENERIC_ALLOCATOR
	ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notf
	USB: omap_udc: use devm_request_irq()
	USB: omap_udc: fix crashes on probe error and module removal
	USB: omap_udc: fix omap_udc_start() on 15xx machines
	USB: omap_udc: fix USB gadget functionality on Palm Tungsten E
	USB: omap_udc: fix rejection of out transfers when DMA is used
	thunderbolt: Prevent root port runtime suspend during NVM upgrade
	drm/meson: add support for 1080p25 mode
	netfilter: ipv6: Preserve link scope traffic original oif
	IB/mlx5: Fix page fault handling for MW
	netfilter: add missing error handling code for register functions
	netfilter: nat: fix double register in masquerade modules
	netfilter: nf_conncount: remove wrong condition check routine
	KVM: VMX: Update shared MSRs to be saved/restored on MSR_EFER.LMA changes
	KVM: x86: fix empty-body warnings
	x86/kvm/vmx: fix old-style function declaration
	net: thunderx: fix NULL pointer dereference in nic_remove
	usb: gadget: u_ether: fix unsafe list iteration
	netfilter: nf_tables: deactivate expressions in rule replecement routine
	ALSA: usb-audio: Add vendor and product name for Dell WD19 Dock
	cachefiles: Fix an assertion failure when trying to update a failed object
	fscache: Fix race in fscache_op_complete() due to split atomic_sub & read
	cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active
	igb: fix uninitialized variables
	ixgbe: recognize 1000BaseLX SFP modules as 1Gbps
	net: hisilicon: remove unexpected free_netdev
	drm/amdgpu: Add delay after enable RLC ucode
	drm/ast: fixed reading monitor EDID not stable issue
	xen: xlate_mmu: add missing header to fix 'W=1' warning
	Revert "xen/balloon: Mark unallocated host memory as UNUSABLE"
	pvcalls-front: fixes incorrect error handling
	pstore/ram: Correctly calculate usable PRZ bytes
	afs: Fix validation/callback interaction
	fscache: fix race between enablement and dropping of object
	cachefiles: Explicitly cast enumerated type in put_object
	fscache, cachefiles: remove redundant variable 'cache'
	nvme: warn when finding multi-port subsystems without multipathing enabled
	nvme: flush namespace scanning work just before removing namespaces
	nvme-rdma: fix double freeing of async event data
	ACPI/IORT: Fix iort_get_platform_device_domain() uninitialized pointer value
	ocfs2: fix deadlock caused by ocfs2_defrag_extent()
	mm/page_alloc.c: fix calculation of pgdat->nr_zones
	hfs: do not free node before using
	hfsplus: do not free node before using
	debugobjects: avoid recursive calls with kmemleak
	proc: fixup map_files test on arm
	kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
	initramfs: clean old path before creating a hardlink
	ocfs2: fix potential use after free
	flexfiles: enforce per-mirror stateid only for v4 DSes
	dax: Check page->mapping isn't NULL
	ALSA: fireface: fix reference to wrong register for clock configuration
	ALSA: hda/realtek - Fixed headphone issue for ALC700
	ALSA: hda/realtek: ALC294 mic and headset-mode fixups for ASUS X542UN
	ALSA: hda/realtek: Enable audio jacks of ASUS UX533FD with ALC294
	ALSA: hda/realtek: Enable audio jacks of ASUS UX433FN/UX333FA with ALC294
	ALSA: hda/realtek - Fix the mute LED regresion on Lenovo X1 Carbon
	IB/hfi1: Fix an out-of-bounds access in get_hw_stats
	bpf: fix off-by-one error in adjust_subprog_starts
	tcp: lack of available data can also cause TSO defer
	Linux 4.19.10

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-12-17 09:39:43 +01:00
Wei Yang
505bc9f389 mm/page_alloc.c: fix calculation of pgdat->nr_zones
[ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]

init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
'zone_idx(zone) + 1' unconditionally.  This is correct in the normal
case, while not exact in hot-plug situation.

This function is used in two places:

  * free_area_init_core()
  * move_pfn_range_to_zone()

In the first case, we are sure zone index increase monotonically.  While
in the second one, this is under users control.

One way to reproduce this is:
----------------------------

1. create a virtual machine with empty node1

   -m 4G,slots=32,maxmem=32G \
   -smp 4,maxcpus=8          \
   -numa node,nodeid=0,mem=4G,cpus=0-3 \
   -numa node,nodeid=1,mem=0G,cpus=4-7

2. hot-add cpu 3-7

   cpu-add [3-7]

2. hot-add memory to nod1

   object_add memory-backend-ram,id=ram0,size=1G
   device_add pc-dimm,id=dimm0,memdev=ram0,node=1

3. online memory with following order

   echo online_movable > memory47/state
   echo online > memory40/state

After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
instead of (ZONE_MOVABLE + 1).

Michal said:
 "Having an incorrect nr_zones might result in all sorts of problems
  which would be quite hard to debug (e.g. reclaim not considering the
  movable zone). I do not expect many users would suffer from this it
  but still this is trivial and obviously right thing to do so
  backporting to the stable tree shouldn't be harmful (last famous
  words)"

Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-17 09:24:40 +01:00
Greg Kroah-Hartman
635c56d224 Merge 4.19.6 into android-4.19
Changes in 4.19.6
	HID: steam: remove input device when a hid client is running.
	efi/libstub: arm: support building with clang
	usb: core: Fix hub port connection events lost
	usb: dwc3: gadget: fix ISOC TRB type on unaligned transfers
	usb: dwc3: gadget: Properly check last unaligned/zero chain TRB
	usb: dwc3: core: Clean up ULPI device
	usb: dwc3: Fix NULL pointer exception in dwc3_pci_remove()
	xhci: Fix leaking USB3 shared_hcd at xhci removal
	xhci: handle port status events for removed USB3 hcd
	xhci: Add check for invalid byte size error when UAS devices are connected.
	usb: xhci: fix uninitialized completion when USB3 port got wrong status
	usb: xhci: fix timeout for transition from RExit to U0
	xhci: Add quirk to workaround the errata seen on Cavium Thunder-X2 Soc
	usb: xhci: Prevent bus suspend if a port connect change or polling state is detected
	ALSA: oss: Use kvzalloc() for local buffer allocations
	MAINTAINERS: Add Sasha as a stable branch maintainer
	Documentation/security-bugs: Clarify treatment of embargoed information
	Documentation/security-bugs: Postpone fix publication in exceptional cases
	mmc: sdhci-pci: Try "cd" for card-detect lookup before using NULL
	mmc: sdhci-pci: Workaround GLK firmware failing to restore the tuning value
	gpio: don't free unallocated ida on gpiochip_add_data_with_key() error path
	iwlwifi: fix wrong WGDS_WIFI_DATA_SIZE
	iwlwifi: mvm: support sta_statistics() even on older firmware
	iwlwifi: mvm: fix regulatory domain update when the firmware starts
	iwlwifi: mvm: don't use SAR Geo if basic SAR is not used
	brcmfmac: fix reporting support for 160 MHz channels
	opp: ti-opp-supply: Dynamically update u_volt_min
	opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call
	tools/power/cpupower: fix compilation with STATIC=true
	v9fs_dir_readdir: fix double-free on p9stat_read error
	selinux: Add __GFP_NOWARN to allocation at str_read()
	Input: synaptics - avoid using uninitialized variable when probing
	bfs: add sanity check at bfs_fill_super()
	sctp: clear the transport of some out_chunk_list chunks in sctp_assoc_rm_peer
	gfs2: Don't leave s_fs_info pointing to freed memory in init_sbd
	llc: do not use sk_eat_skb()
	mm: don't warn about large allocations for slab
	mm/memory.c: recheck page table entry with page table lock held
	tcp: do not release socket ownership in tcp_close()
	drm/fb-helper: Blacklist writeback when adding connectors to fbdev
	drm/amdgpu: Add missing firmware entry for HAINAN
	drm/vc4: Set ->legacy_cursor_update to false when doing non-async updates
	drm/amdgpu: Fix oops when pp_funcs->switch_power_profile is unset
	drm/i915: Disable LP3 watermarks on all SNB machines
	drm/ast: change resolution may cause screen blurred
	drm/ast: fixed cursor may disappear sometimes
	drm/ast: Remove existing framebuffers before loading driver
	can: flexcan: Unlock the MB unconditionally
	can: dev: can_get_echo_skb(): factor out non sending code to __can_get_echo_skb()
	can: dev: __can_get_echo_skb(): replace struct can_frame by canfd_frame to access frame length
	can: dev: __can_get_echo_skb(): Don't crash the kernel if can_priv::echo_skb is accessed out of bounds
	can: dev: __can_get_echo_skb(): print error message, if trying to echo non existing skb
	can: rx-offload: introduce can_rx_offload_get_echo_skb() and can_rx_offload_queue_sorted() functions
	can: rx-offload: rename can_rx_offload_irq_queue_err_skb() to can_rx_offload_queue_tail()
	can: flexcan: use can_rx_offload_queue_sorted() for flexcan_irq_bus_*()
	can: flexcan: handle tx-complete CAN frames via rx-offload infrastructure
	can: raw: check for CAN FD capable netdev in raw_sendmsg()
	can: hi311x: Use level-triggered interrupt
	can: flexcan: Always use last mailbox for TX
	can: flexcan: remove not needed struct flexcan_priv::tx_mb and struct flexcan_priv::tx_mb_idx
	ACPICA: AML interpreter: add region addresses in global list during initialization
	IB/hfi1: Eliminate races in the SDMA send error path
	fsnotify: generalize handling of extra event flags
	fanotify: fix handling of events on child sub-directory
	pinctrl: meson: fix pinconf bias disable
	pinctrl: meson: fix gxbb ao pull register bits
	pinctrl: meson: fix gxl ao pull register bits
	pinctrl: meson: fix meson8 ao pull register bits
	pinctrl: meson: fix meson8b ao pull register bits
	tools/testing/nvdimm: Fix the array size for dimm devices.
	scsi: lpfc: fix remoteport access
	scsi: hisi_sas: Remove set but not used variable 'dq_list'
	KVM: PPC: Move and undef TRACE_INCLUDE_PATH/FILE
	cpufreq: imx6q: add return value check for voltage scale
	rtc: cmos: Do not export alarm rtc_ops when we do not support alarms
	rtc: pcf2127: fix a kmemleak caused in pcf2127_i2c_gather_write
	crypto: simd - correctly take reqsize of wrapped skcipher into account
	floppy: fix race condition in __floppy_read_block_0()
	powerpc/io: Fix the IO workarounds code to work with Radix
	sched/fair: Fix cpu_util_wake() for 'execl' type workloads
	perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs
	block: copy ioprio in __bio_clone_fast() and bounce
	SUNRPC: Fix a bogus get/put in generic_key_to_expire()
	riscv: add missing vdso_install target
	RISC-V: Silence some module warnings on 32-bit
	drm/amdgpu: fix bug with IH ring setup
	kdb: Use strscpy with destination buffer size
	NFSv4: Fix an Oops during delegation callbacks
	powerpc/numa: Suppress "VPHN is not supported" messages
	efi/arm: Revert deferred unmap of early memmap mapping
	z3fold: fix possible reclaim races
	mm, memory_hotplug: check zone_movable in has_unmovable_pages
	tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
	mm, page_alloc: check for max order in hot path
	dax: Avoid losing wakeup in dax_lock_mapping_entry
	include/linux/pfn_t.h: force '~' to be parsed as an unary operator
	tty: wipe buffer.
	tty: wipe buffer if not echoing data
	gfs2: Fix iomap buffer head reference counting bug
	rcu: Make need_resched() respond to urgent RCU-QS needs
	media: ov5640: Re-work MIPI startup sequence
	media: ov5640: Fix timings setup code
	media: ov5640: fix exposure regression
	media: ov5640: fix auto gain & exposure when changing mode
	media: ov5640: fix wrong binning value in exposure calculation
	media: ov5640: fix auto controls values when switching to manual mode
	Linux 4.19.6

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-12-06 09:32:46 +01:00
Rik van Riel
fe0ea3084c ANDROID: add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.

[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.

[surenb]
Will be reverted as soon as Android framework is reworked to
use upstream-supported watermark_scale_factor instead of
extra_free_kbytes.

Bug: 86445363
Bug: 109664768
Bug: 120445732
Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2018-12-05 09:48:12 -08:00
Michal Hocko
9dec38554a mm, page_alloc: check for max order in hot path
[ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]

Konstantin has noticed that kvmalloc might trigger the following
warning:

  WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
  [...]
  Call Trace:
   fragmentation_index+0x76/0x90
   compaction_suitable+0x4f/0xf0
   shrink_node+0x295/0x310
   node_reclaim+0x205/0x250
   get_page_from_freelist+0x649/0xad0
   __alloc_pages_nodemask+0x12a/0x2a0
   kmalloc_large_node+0x47/0x90
   __kmalloc_node+0x22b/0x2e0
   kvmalloc_node+0x3e/0x70
   xt_alloc_table_info+0x3a/0x80 [x_tables]
   do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
   nf_setsockopt+0x44/0x60
   SyS_setsockopt+0x6f/0xc0
   do_syscall_64+0x67/0x120
   entry_SYSCALL_64_after_hwframe+0x3d/0xa2

the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already.  This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile.  A recent UBSAN report just underlines that
by the following report

  UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
  shift exponent 51 is too large for 32-bit type 'int'
  CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0xd2/0x148 lib/dump_stack.c:113
   ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
   __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
   __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
   zone_watermark_fast mm/page_alloc.c:3216 [inline]
   get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
   __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
   alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
   alloc_pages include/linux/gfp.h:509 [inline]
   __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
   dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
   raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
   raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
   fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
   fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
   __blkdev_driver_ioctl block/ioctl.c:303 [inline]
   blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
   block_ioctl+0x105/0x150 fs/block_dev.c:1883
   vfs_ioctl fs/ioctl.c:46 [inline]
   do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
   ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
   __do_sys_ioctl fs/ioctl.c:709 [inline]
   __se_sys_ioctl fs/ioctl.c:707 [inline]
   __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
   do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note that this is not a kvmalloc path.  It is just that the fast path
really depends on having sanitzed order as well.  Therefore move the
order check to the fast path.

Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-01 09:37:34 +01:00
Michal Hocko
b44fd1268b mm, memory_hotplug: check zone_movable in has_unmovable_pages
[ Upstream commit 9d7899999c62c1a81129b76d2a6ecbc4655e1597 ]

Page state checks are racy.  Under a heavy memory workload (e.g.  stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet.  A debugging patch to
dump the struct page state shows

  has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
  page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
  flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)

Note that the state has been checked for both PageLRU and PageSwapBacked
already.  Closing this race completely would require some sort of retry
logic.  This can be tricky and error prone (think of potential endless
or long taking loops).

Workaround this problem for movable zones at least.  Such a zone should
only contain movable pages.  Commit 15c30bc090 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though.  Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check.  Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.

Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes: 15c30bc090 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-01 09:37:33 +01:00
Srikar Dronamraju
e054637597 mm, sched/numa: Remove remaining traces of NUMA rate-limiting
Remove the leftover pglist_data::numabalancing_migrate_lock and its
initialization, we stopped using this lock with:

  efaffc5e40 ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration")

[ mingo: Rewrote the changelog. ]

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-09 08:30:51 +02:00
Mel Gorman
efaffc5e40 mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
Rate limiting of page migrations due to automatic NUMA balancing was
introduced to mitigate the worst-case scenario of migrating at high
frequency due to false sharing or slowly ping-ponging between nodes.
Since then, a lot of effort was spent on correctly identifying these
pages and avoiding unnecessary migrations and the safety net may no longer
be required.

Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
avoids spreading STREAM tasks wide prematurely. However, once the task
was properly placed, it delayed migrating the memory due to rate limiting.
Increasing the limit fixed the problem for him.

Currently, the limit is hard-coded and does not account for the real
capabilities of the hardware. Even if an estimate was attempted, it would
not properly account for the number of memory controllers and it could
not account for the amount of bandwidth used for normal accesses. Rather
than fudging, this patch simply eliminates the rate limiting.

However, Jirka reports that a STREAM configuration using multiple
processes achieved similar performance to 4.16. In local tests, this patch
improved performance of STREAM relative to the baseline but it is somewhat
machine-dependent. Most workloads show little or not performance difference
implying that there is not a heavily reliance on the throttling mechanism
and it is safe to remove.

STREAM on 2-socket machine
                         4.19.0-rc5             4.19.0-rc5
                         numab-v1r1       noratelimit-v1r1
MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02 11:31:14 +02:00
Aneesh Kumar K.V
464c7ffbcb mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.
When scanning for movable pages, filter out Hugetlb pages if hugepage
migration is not supported.  Without this we hit infinte loop in
__offline_pages() where we do

	pfn = scan_movable_pages(start_pfn, end_pfn);
	if (pfn) { /* We have movable pages */
		ret = do_migrate_range(pfn, end_pfn);
		goto repeat;
	}

Fix this by checking hugepage_migration_supported both in
has_unmovable_pages which is the primary backoff mechanism for page
offlining and for consistency reasons also into scan_movable_pages
because it doesn't make any sense to return a pfn to non-migrateable
huge page.

This issue was revealed by, but not caused by 72b39cfc4d ("mm,
memory_hotplug: do not fail offlining too early").

Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com
Fixes: 72b39cfc4d ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Haren Myneni <haren@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Mukesh Ojha
13ba17bee1 notifier: Remove notifier header file wherever not used
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
2018-08-30 12:56:40 +02:00
Naoya Horiguchi
d4ae9916ea mm: soft-offline: close the race against page allocation
A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
allocate a page that was just freed on the way of soft-offline.  This is
undesirable because soft-offline (which is about corrected error) is
less aggressive than hard-offline (which is about uncorrected error),
and we can make soft-offline fail and keep using the page for good
reason like "system is busy."

Two main changes of this patch are:

- setting migrate type of the target page to MIGRATE_ISOLATE. As done
  in free_unref_page_commit(), this makes kernel bypass pcplist when
  freeing the page. So we can assume that the page is in freelist just
  after put_page() returns,

- setting PG_hwpoison on free page under zone->lock which protects
  freelists, so this allows us to avoid setting PG_hwpoison on a page
  that is decided to be allocated soon.

[akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zy.zhengyi@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Oscar Salvador
03e85f9d5f mm/page_alloc: Introduce free_area_init_core_hotplug
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core().  But there is
some code that we do not really need to run when we are coming from such
path.

free_area_init_core() performs the following actions:

1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
   when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
   memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone

When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.

Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages.  For the same reason, action #3 results
always in manages_pages being 0.

Action #5 and #6 are performed later on when onlining the pages:
 online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
 online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

This patch does two things:

First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:

1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data

These two functions are: pgdat_init_internals() and zone_init_internals().

The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():

Currently, we call free_area_init_node() from the memhotplug path.  In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.

Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.

Actually, we re-set these values to 0 later on with the calls to:

reset_node_managed_pages()
reset_node_present_pages()

The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:

online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.

[osalvador@techadventures.net: fix section usage]
  Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
  Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
0188dc98ad mm/page_alloc: inline function to handle CONFIG_DEFERRED_STRUCT_PAGE_INIT
Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
function.  Not having an ifdef in the function makes the code more
readable.

Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Pavel Tatashin
7cc2a9596d mm: remove __paginginit
__paginginit is the same thing as __meminit except for platforms without
sparsemem, there it is defined as __init.

Remove __paginginit and use __meminit.  Use __ref in one single function
that merges __meminit and __init sections: setup_usemap().

Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Pavel Tatashin
c1093b746c mm: access zone->node via zone_to_nid() and zone_set_nid()
zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in c
files.

Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
ace1db3976 mm/page_alloc.c: move ifdefery out of free_area_init_core
Patch series "Refactor free_area_init_core and add
free_area_init_core_hotplug", v6.

This patchset does three things:

 1) Clean up/refactor free_area_init_core/free_area_init_node
    by moving the ifdefery out of the functions.
 2) Move the pgdat/zone initialization in free_area_init_core to its
    own function.
 3) Introduce free_area_init_core_hotplug, a small subset of
    free_area_init_core, which is only called from memhotlug code path. In this
    way, we have:

    free_area_init_core: called during early initialization
    free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)

This patch (of 5):

Moving the #ifdefs out of the function makes it easier to follow.

Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Aaron Lu
d8a759b570 mm, page_alloc: double zone's batchsize
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy.  When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.

zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
then, CPU has envolved a lot and CPU's cache sizes also increased.

Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how this
new batch size work with other workloads.

In the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
In this phase's test data, two candidates: 63 and 127 are chosen.

In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default according
to their results.

Most test results are flat.  will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
drop.  Further analysis showed that, with a larger pcp->batch and thus
larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
maintained in this patch), zone lock contention shifted to LRU add side
lock contention and that caused performance drop.  This performance drop
might be mitigated by others' work on optimizing LRU lock.

Another downside of increasing pcp->batch is, when PCP is used up and need
to fetch a batch of pages from Buddy, since batch is increased, that time
can be longer than before.  My understanding is, this doesn't affect
slowpath where direct reclaim and compaction dominates.  For fastpath,
throughput is a win(according to will-it-scale/page_fault1) but worst
latency can be larger now.

Overall, I think double the batch size from 31 to 63 is relatively safe
and provide good performance boost for high-core-count systems.

The two phase's test results are listed below(all tests are done with THP
disabled).

Phase one(will-it-scale/page_fault1) test results:

Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.

batch   score     change   zone_contention   lru_contention   total_contention
 31   15345900    +0.00%       64%                 8%           72%
 53   17903847   +16.67%       32%                38%           70%
 63   17992886   +17.25%       24%                45%           69%
 73   18022825   +17.44%       10%                61%           71%
119   18023401   +17.45%        4%                66%           70%
127   18029012   +17.48%        3%                66%           69%
137   18036075   +17.53%        4%                66%           70%
165   18035964   +17.53%        2%                67%           69%
188   18101105   +17.95%        2%                67%           69%
223   18130951   +18.15%        2%                67%           69%
255   18118898   +18.07%        2%                67%           69%
267   18101559   +17.96%        2%                67%           69%
299   18160468   +18.34%        2%                68%           70%
320   18139845   +18.21%        2%                67%           69%
393   18160869   +18.34%        2%                68%           70%
424   18170999   +18.41%        2%                68%           70%
458   18144868   +18.24%        2%                68%           70%
467   18142366   +18.22%        2%                68%           70%
498   18154549   +18.30%        1%                68%           69%
511   18134525   +18.17%        1%                69%           70%

Broadwell-EX: similar pattern as Skylake-EX.

batch   score     change   zone_contention   lru_contention   total_contention
 31   16703983    +0.00%       67%                 7%           74%
 53   18195393    +8.93%       43%                28%           71%
 63   18288885    +9.49%       38%                33%           71%
 73   18344329    +9.82%       35%                37%           72%
119   18535529   +10.96%       24%                46%           70%
127   18513596   +10.83%       23%                48%           71%
137   18514327   +10.84%       23%                48%           71%
165   18511840   +10.82%       22%                49%           71%
188   18593478   +11.31%       17%                53%           70%
223   18601667   +11.36%       17%                52%           69%
255   18774825   +12.40%       12%                58%           70%
267   18754781   +12.28%        9%                60%           69%
299   18892265   +13.10%        7%                63%           70%
320   18873812   +12.99%        8%                62%           70%
393   18891174   +13.09%        6%                64%           70%
424   18975108   +13.60%        6%                64%           70%
458   18932364   +13.34%        8%                62%           70%
467   18960891   +13.51%        5%                65%           70%
498   18944526   +13.41%        5%                64%           69%
511   18960839   +13.51%        5%                64%           69%

Skylake-EP: although increased batch reduced zone->lock contention, but
the effect is not as good as EX: zone->lock contention is still as high as
20% with a very high batch value instead of 1% on Skylake-EX or 5% on
Broadwell-EX.  Also, total_contention actually decreased with a higher
batch but that doesn't translate to performance increase.

batch   score    change   zone_contention   lru_contention   total_contention
 31   9554867    +0.00%       66%                 3%           69%
 53   9855486    +3.15%       63%                 3%           66%
 63   9980145    +4.45%       62%                 4%           66%
 73   10092774   +5.63%       62%                 5%           67%
119   10310061   +7.90%       45%                19%           64%
127   10342019   +8.24%       42%                19%           61%
137   10358182   +8.41%       42%                21%           63%
165   10397060   +8.81%       37%                24%           61%
188   10341808   +8.24%       34%                26%           60%
223   10349135   +8.31%       31%                27%           58%
255   10327189   +8.08%       28%                29%           57%
267   10344204   +8.26%       27%                29%           56%
299   10325043   +8.06%       25%                30%           55%
320   10310325   +7.91%       25%                31%           56%
393   10293274   +7.73%       21%                31%           52%
424   10311099   +7.91%       21%                32%           53%
458   10321375   +8.02%       21%                32%           53%
467   10303881   +7.84%       21%                32%           53%
498   10332462   +8.14%       20%                33%           53%
511   10325016   +8.06%       20%                32%           52%

Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep total
contention at 70%.

batch   score    change   zone_contention   lru_contention   total_contention
 31   10121178   +0.00%       19%                50%           69%
 53   10142366   +0.21%        6%                63%           69%
 63   10117984   -0.03%       11%                58%           69%
 73   10123330   +0.02%        7%                63%           70%
119   10108791   -0.12%        2%                67%           69%
127   10166074   +0.44%        3%                66%           69%
137   10141574   +0.20%        3%                66%           69%
165   10154499   +0.33%        2%                68%           70%
188   10124921   +0.04%        2%                67%           69%
223   10137399   +0.16%        2%                67%           69%
255   10143289   +0.22%        0%                68%           68%
267   10123535   +0.02%        1%                68%           69%
299   10140952   +0.20%        0%                68%           68%
320   10163170   +0.41%        0%                68%           68%
393   10000633   -1.19%        0%                69%           69%
424   10087998   -0.33%        0%                69%           69%
458   10187116   +0.65%        0%                69%           69%
467   10146790   +0.25%        0%                69%           69%
498   10197958   +0.76%        0%                69%           69%
511   10152326   +0.31%        0%                69%           69%

Haswell-EP: similar to Broadwell-EP.

batch   score   change   zone_contention   lru_contention   total_contention
 31   10442205   +0.00%       14%                48%           62%
 53   10442255   +0.00%        5%                57%           62%
 63   10452059   +0.09%        6%                57%           63%
 73   10482349   +0.38%        5%                59%           64%
119   10454644   +0.12%        3%                60%           63%
127   10431514   -0.10%        3%                59%           62%
137   10423785   -0.18%        3%                60%           63%
165   10481216   +0.37%        2%                61%           63%
188   10448755   +0.06%        2%                61%           63%
223   10467144   +0.24%        2%                61%           63%
255   10480215   +0.36%        2%                61%           63%
267   10484279   +0.40%        2%                61%           63%
299   10466450   +0.23%        2%                61%           63%
320   10452578   +0.10%        2%                61%           63%
393   10499678   +0.55%        1%                62%           63%
424   10481454   +0.38%        1%                62%           63%
458   10473562   +0.30%        1%                62%           63%
467   10484269   +0.40%        0%                62%           62%
498   10505599   +0.61%        0%                62%           62%
511   10483395   +0.39%        0%                62%           62%

Westmere-EP: contention is pretty small so not interesting.  Note too high
a batch value could hurt performance.

batch   score   change   zone_contention   lru_contention   total_contention
 31   4831523   +0.00%        2%                 3%            5%
 53   4834086   +0.05%        2%                 4%            6%
 63   4834262   +0.06%        2%                 3%            5%
 73   4832851   +0.03%        2%                 4%            6%
119   4830534   -0.02%        1%                 3%            4%
127   4827461   -0.08%        1%                 4%            5%
137   4827459   -0.08%        1%                 3%            4%
165   4820534   -0.23%        0%                 4%            4%
188   4817947   -0.28%        0%                 3%            3%
223   4809671   -0.45%        0%                 3%            3%
255   4802463   -0.60%        0%                 4%            4%
267   4801634   -0.62%        0%                 3%            3%
299   4798047   -0.69%        0%                 3%            3%
320   4793084   -0.80%        0%                 3%            3%
393   4785877   -0.94%        0%                 3%            3%
424   4782911   -1.01%        0%                 3%            3%
458   4779346   -1.08%        0%                 3%            3%
467   4780306   -1.06%        0%                 3%            3%
498   4780589   -1.05%        0%                 3%            3%
511   4773724   -1.20%        0%                 3%            3%

Skylake-Desktop: similar to Westmere-EP, nothing interesting.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3906608   +0.00%        2%                 3%            5%
 53   3940164   +0.86%        2%                 3%            5%
 63   3937289   +0.79%        2%                 3%            5%
 73   3940201   +0.86%        2%                 3%            5%
119   3933240   +0.68%        2%                 3%            5%
127   3930514   +0.61%        2%                 4%            6%
137   3938639   +0.82%        0%                 3%            3%
165   3908755   +0.05%        0%                 3%            3%
188   3905621   -0.03%        0%                 3%            3%
223   3903015   -0.09%        0%                 4%            4%
255   3889480   -0.44%        0%                 3%            3%
267   3891669   -0.38%        0%                 4%            4%
299   3898728   -0.20%        0%                 4%            4%
320   3894547   -0.31%        0%                 4%            4%
393   3875137   -0.81%        0%                 4%            4%
424   3874521   -0.82%        0%                 3%            3%
458   3880432   -0.67%        0%                 4%            4%
467   3888715   -0.46%        0%                 3%            3%
498   3888633   -0.46%        0%                 4%            4%
511   3875305   -0.80%        0%                 5%            5%

Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3511158   +0.00%        2%                 5%            7%
 53   3555445   +1.26%        2%                 6%            8%
 63   3561082   +1.42%        2%                 6%            8%
 73   3547218   +1.03%        2%                 6%            8%
119   3571319   +1.71%        1%                 7%            8%
127   3549375   +1.09%        0%                 6%            6%
137   3560233   +1.40%        0%                 6%            6%
165   3555176   +1.25%        2%                 6%            8%
188   3551501   +1.15%        0%                 8%            8%
223   3531462   +0.58%        0%                 7%            7%
255   3570400   +1.69%        0%                 7%            7%
267   3532235   +0.60%        1%                 8%            9%
299   3562326   +1.46%        0%                 6%            6%
320   3553569   +1.21%        0%                 8%            8%
393   3539519   +0.81%        0%                 7%            7%
424   3549271   +1.09%        0%                 8%            8%
458   3528885   +0.50%        0%                 8%            8%
467   3526554   +0.44%        0%                 7%            7%
498   3525302   +0.40%        0%                 9%            9%
511   3527556   +0.47%        0%                 8%            8%

Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.

batch   score   change   zone_contention   lru_contention   total_contention
 31   1744495   +0.00%        0%                 0%            0%
 53   1755341   +0.62%        0%                 0%            0%
 63   1758469   +0.80%        0%                 0%            0%
 73   1759626   +0.87%        0%                 0%            0%
119   1770417   +1.49%        0%                 0%            0%
127   1768252   +1.36%        0%                 0%            0%
137   1767848   +1.34%        0%                 0%            0%
165   1765088   +1.18%        0%                 0%            0%
188   1766918   +1.29%        0%                 0%            0%
223   1767866   +1.34%        0%                 0%            0%
255   1768074   +1.35%        0%                 0%            0%
267   1763187   +1.07%        0%                 0%            0%
299   1765620   +1.21%        0%                 0%            0%
320   1767603   +1.32%        0%                 0%            0%
393   1764612   +1.15%        0%                 0%            0%
424   1758476   +0.80%        0%                 0%            0%
458   1758593   +0.81%        0%                 0%            0%
467   1757915   +0.77%        0%                 0%            0%
498   1753363   +0.51%        0%                 0%            0%
511   1755548   +0.63%        0%                 0%            0%

Phase two test results:
Note: all percent change is against base(batch=31).

ebizzy.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
lkp-sb02          98937          98937    +0.0%       99030 +0.1%

kbuild.buildtime (less is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%

netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
lkp-sb02          29990         30137  +0.5%         30103 +0.4%

oltp.transactions (higer is better)

machine         batch=31      batch=63             batch=127
lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%

pigz.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%

will-it-scale.malloc1.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
lkp-sb02          589286       589284 -0.0%         588101 -0.2%

will-it-scale.malloc1.threads (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%

will-it-scale.malloc2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%

will-it-scale.malloc2.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%

will-it-scale.page_fault2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
lkp-sb02          842616        837844  -0.6%         833921  -1.0%

will-it-scale.page_fault2.threads

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
lkp-sb02          750626        748615    -0.3%      746621    -0.5%

will-it-scale.page_fault3.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%

will-it-scale.page_fault3.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%

will-it-scale.read1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
lkp-skl-d01     10969487      10972650     +0.0%    no data
lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%

will-it-scale.read1.threads (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%

will-it-scale.write1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%

will-it-scale.write1.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
lkp-skl-d01       8346174       8235695     -1.3%     no data
lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%

vm-scalability.anon-r-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
lkp-sb02          570572        567854     -0.5%     568449     -0.4%

vm-scalability.anon-r-rand-mt.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
lkp-sb02          567137        567346     +0.0%     566144     -0.2%

vm-scalability.lru-file-mmap-read.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%

vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
lkp-sb02          258939        258787  -0.1%        258199  -0.3%

Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Michal Hocko
9ea9a68064 mm: drop VM_BUG_ON from __get_free_pages
There is no real reason to blow up just because the caller doesn't know
that __get_free_pages cannot return highmem pages.  Simply fix that up
silently.  Even if we have some confused users such a fixup will not be
harmful.

[akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiankang Chen <chenjiankang1@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Vlastimil Babka
d6a24df006 mm, page_alloc: actually ignore mempolicies for high priority allocations
__alloc_pages_slowpath() has for a long time contained code to ignore
node restrictions from memory policies for high priority allocations.
The current code that resets the zonelist iterator however does
effectively nothing after commit 7810e6781e ("mm, page_alloc: do not
break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
Even before that commit, mempolicy restrictions were still not ignored,
as they are passed in ac->nodemask which is untouched by the code.

We can either remove the code, or make it work as intended.  Since
ac->nodemask can be set from task's mempolicy via alloc_pages_current()
and thus also alloc_pages(), it may indeed affect kernel allocations,
and it makes sense to ignore it to allow progress for high priority
allocations.

Thus, this patch resets ac->nodemask to NULL in such cases.  This
assumes all callers can handle it (i.e.  there are no guarantees as in
the case of __GFP_THISNODE) which seems to be the case.  The same
assumption is already present in check_retry_cpuset() for some time.

The expected effect is that high priority kernel allocations in the
context of userspace tasks (e.g.  OOM victims) restricted by mempolicies
will have higher chance to succeed if they are restricted to nodes with
depleted memory, while there are other nodes with free memory left.

It's not a new intention, but for the first time the code will match the
intention, AFAICS.  It was intended by commit 183f6371aa ("mm: ignore
mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
really worked, as mempolicy restriction was already encoded in nodemask,
not zonelist, at that time.

So originally that was for ALLOC_NO_WATERMARK only.  Then it was
adjusted by e46e7b77c9 ("mm, page_alloc: recalculate the preferred
zoneref if the context can ignore memory policies") and cd04ae1e2d
("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
current state.  So even GFP_ATOMIC would now ignore mempolicies after
the initial attempts fail - if the code worked as people thought it
does.

Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Pavel Tatashin
720e14ebec mm: skip invalid pages block at a time in zero_resv_unresv()
The role of zero_resv_unavail() is to make sure that every struct page
that is allocated but is not backed by memory that is accessible by
kernel is zeroed and not in some uninitialized state.

Since struct pages are allocated in blocks (2M pages in x86 case), we
can skip pageblock_nr_pages at a time, when the first one is found to be
invalid.

This optimization may help since now on x86 every hole in e820 maps is
marked as reserved in memblock, and thus will go through this function.

This function is called before sched_clock() is initialized, so I used
my x86 early boot clock patches to measure the performance improvement.

With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
but with this optimization 0.001103s.

Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Linus Torvalds
b018fc9800 Merge tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "These add a new framework for CPU idle time injection, to be used by
  all of the idle injection code in the kernel in the future, fix some
  issues and add a number of relatively small extensions in multiple
  places.

  Specifics:

   - Add a new framework for CPU idle time injection (Daniel Lezcano).

   - Add AVS support to the armada-37xx cpufreq driver (Gregory
     CLEMENT).

   - Add support for current CPU frequency reporting to the ACPI CPPC
     cpufreq driver (George Cherian).

   - Rework the cooling device registration in the imx6q/thermal driver
     (Bastian Stender).

   - Make the pcc-cpufreq driver refuse to work with dynamic scaling
     governors on systems with many CPUs to avoid scalability issues
     with it (Rafael Wysocki).

   - Fix the intel_pstate driver to report different maximum CPU
     frequencies on systems where they really are different and to
     ignore the turbo active ratio if hardware-managend P-states (HWP)
     are in use; make it use the match_string() helper (Xie Yisheng,
     Srinivas Pandruvada).

   - Fix a minor deferred probe issue in the qcom-kryo cpufreq driver
     (Niklas Cassel).

   - Add a tracepoint for the tracking of frequency limits changes (from
     Andriod) to the cpufreq core (Ruchi Kandoi).

   - Fix a circular lock dependency between CPU hotplug and sysfs
     locking in the cpufreq core reported by lockdep (Waiman Long).

   - Avoid excessive error reports on driver registration failures in
     the ARM cpuidle driver (Sudeep Holla).

   - Add a new device links flag to the driver core to make links go
     away automatically on supplier driver removal (Vivek Gautam).

   - Eliminate potential race condition between system-wide power
     management transitions and system shutdown (Pingfan Liu).

   - Add a quirk to save NVS memory on system suspend for the ASUS 1025C
     laptop (Willy Tarreau).

   - Make more systems use suspend-to-idle (instead of ACPI S3) by
     default (Tristian Celestin).

   - Get rid of stack VLA usage in the low-level hibernation code on
     64-bit x86 (Kees Cook).

   - Fix error handling in the hibernation core and mark an expected
     fall-through switch in it (Chengguang Xu, Gustavo Silva).

   - Extend the generic power domains (genpd) framework to support
     attaching a device to a power domain by name (Ulf Hansson).

   - Fix device reference counting and user limits initialization in the
     devfreq core (Arvind Yadav, Matthias Kaehlcke).

   - Fix a few issues in the rk3399_dmc devfreq driver and improve its
     documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).

   - Drop a redundant error message from the exynos-ppmu devfreq driver
     (Markus Elfring)"

* tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
  PM / reboot: Eliminate race between reboot and suspend
  PM / hibernate: Mark expected switch fall-through
  cpufreq: intel_pstate: Ignore turbo active ratio in HWP
  cpufreq: Fix a circular lock dependency problem
  cpu/hotplug: Add a cpus_read_trylock() function
  x86/power/hibernate_64: Remove VLA usage
  cpufreq: trace frequency limits change
  cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP
  cpufreq: pcc-cpufreq: Disable dynamic scaling on many-CPU systems
  cpufreq: qcom-kryo: Silently error out on EPROBE_DEFER
  cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC
  cpufreq: armada-37xx: Add AVS support
  dt-bindings: marvell: Add documentation for the Armada 3700 AVS binding
  PM / devfreq: rk3399_dmc: Fix duplicated opp table on reload.
  PM / devfreq: Init user limits from OPP limits, not viceversa
  PM / devfreq: rk3399_dmc: fix spelling mistakes.
  PM / devfreq: rk3399_dmc: do not print error when get supply and clk defer.
  dt-bindings: devfreq: rk3399_dmc: move interrupts to be optional.
  PM / devfreq: rk3399_dmc: remove wait for dcf irq event.
  dt-bindings: clock: add rk3399 DDR3 standard speed bins.
  ...
2018-08-14 13:12:24 -07:00
Rafael J. Wysocki
17bc3432e3 Merge branches 'pm-core', 'pm-domains', 'pm-sleep', 'acpi-pm' and 'pm-cpuidle'
Merge changes in the PM core, system-wide PM infrastructure, generic
power domains (genpd) framework, ACPI PM infrastructure and cpuidle
for 4.19.

* pm-core:
  driver core: Add flag to autoremove device link on supplier unbind
  driver core: Rename flag AUTOREMOVE to AUTOREMOVE_CONSUMER

* pm-domains:
  PM / Domains: Introduce dev_pm_domain_attach_by_name()
  PM / Domains: Introduce option to attach a device by name to genpd
  PM / Domains: dt: Add a power-domain-names property

* pm-sleep:
  PM / reboot: Eliminate race between reboot and suspend
  PM / hibernate: Mark expected switch fall-through
  x86/power/hibernate_64: Remove VLA usage
  PM / hibernate: cast PAGE_SIZE to int when comparing with error code

* acpi-pm:
  ACPI / PM: save NVS memory for ASUS 1025C laptop
  ACPI / PM: Default to s2idle in all machines supporting LP S0

* pm-cpuidle:
  ARM: cpuidle: silence error on driver registration failure
2018-08-14 09:48:10 +02:00
Pingfan Liu
55f2503c3b PM / reboot: Eliminate race between reboot and suspend
At present, "systemctl suspend" and "shutdown" can run in parrallel. A
system can suspend after devices_shutdown(), and resume. Then the shutdown
task goes on to power off. This causes many devices are not really shut
off. Hence replacing reboot_mutex with system_transition_mutex (renamed
from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
system_transition_mutex can be better to reflect the purpose of the mutex.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-08-06 12:35:20 +02:00
Dave Hansen
0d83432811 mm: Allow non-direct-map arguments to free_reserved_area()
free_reserved_area() takes pointers as arguments to show which addresses
should be freed.  However, it does this in a somewhat ambiguous way.  If it
gets a kernel direct map address, it always works.  However, if it gets an
address that is part of the kernel image alias mapping, it can fail.

It fails if all of the following happen:
 * The specified address is part of the kernel image alias
 * Poisoning is requested (forcing a memset())
 * The address is in a read-only portion of the kernel image

The memset() fails on the read-only mapping, of course.
free_reserved_area() *is* called both on the direct map and on kernel image
alias addresses.  We've just lucked out thus far that the kernel image
alias areas it gets used on are read-write.  I'm fairly sure this has been
just a happy accident.

It is quite easy to make free_reserved_area() work for all cases: just
convert the address to a direct map address before doing the memset(), and
do this unconditionally.  There is little chance of a regression here
because we previously did a virt_to_page() on the address for the memset,
so we know these are not highmem pages for which virt_to_page() would fail.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@google.com
Cc: aarcange@redhat.com
Cc: jgross@suse.com
Cc: jpoimboe@redhat.com
Cc: gregkh@linuxfoundation.org
Cc: peterz@infradead.org
Cc: hughd@google.com
Cc: torvalds@linux-foundation.org
Cc: bp@alien8.de
Cc: luto@kernel.org
Cc: ak@linux.intel.com
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
2018-08-05 22:21:02 +02:00
Pavel Tatashin
d1b47a7c9e mm: don't do zero_resv_unavail if memmap is not allocated
Moving zero_resv_unavail before memmap_init_zone(), caused a regression on
x86-32.

The cause is that we access struct pages before they are allocated when
CONFIG_FLAT_NODE_MEM_MAP is used.

free_area_init_nodes()
  zero_resv_unavail()
    mm_zero_struct_page(pfn_to_page(pfn)); <- struct page is not alloced
  free_area_init_node()
    if CONFIG_FLAT_NODE_MEM_MAP
      alloc_node_mem_map()
        memblock_virt_alloc_node_nopanic() <- struct page alloced here

On the other hand memblock_virt_alloc_node_nopanic() zeroes all the memory
that it returns, so we do not need to do zero_resv_unavail() here.

Fixes: e181ae0c5d ("mm: zero unavailable pages before memmap init")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Matt Hart <matt@mattface.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-16 09:41:57 -07:00
Pavel Tatashin
e181ae0c5d mm: zero unavailable pages before memmap init
We must zero struct pages for memory that is not backed by physical
memory, or kernel does not have access to.

Recently, there was a change which zeroed all memmap for all holes in
e820.  Unfortunately, it introduced a bug that is discussed here:

  https://www.spinics.net/lists/linux-mm/msg156764.html

Linus, also saw this bug on his machine, and confirmed that reverting
commit 124049decb ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") fixes the issue.

The problem is that we incorrectly zero some struct pages after they
were setup.

The fix is to zero unavailable struct pages prior to initializing of
struct pages.

A more detailed fix should come later that would avoid double zeroing
cases: one in __init_single_page(), the other one in
zero_resv_unavail().

Fixes: 124049decb ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-14 11:02:20 -07:00
Joe Perches
0825a6f986 mm: use octal not symbolic permissions
mm/*.c files use symbolic and octal styles for permissions.

Using octal and not symbolic permissions is preferred by many as more
readable.

https://lkml.org/lkml/2016/8/2/1945

Prefer the direct use of octal for permissions.

Done using
$ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
and some typing.

Before:	 $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
44
After:	 $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
86

Miscellanea:

o Whitespace neatening around these conversions.

Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 07:55:25 +09:00
Vlastimil Babka
7810e6781e mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies.  The zonelist is obtained
from current CPU's node.  This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g.  because the
allocating thread has been migrated to a different CPU.

This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended.  If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.

Current SLAB implementation seems to be immune by luck thanks to commit
511e3a0588 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.

We can fix it by simply removing the zonelist reset completely.  There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place.  This was different
when commit 183f6371aa ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.

We might consider this for 4.17 although I don't know if there's
anything currently broken.

SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a0588 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is.  That's at least 4.4 LTS.  Older
ones I'll have to check.

So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes.  BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.

Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 183f6371aa ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00