Commit Graph

744 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
24a799db09 Merge 4.19.297 into android-4.19-stable
Changes in 4.19.297
	indirect call wrappers: helpers to speed-up indirect calls of builtin
	net: use indirect calls helpers at the socket layer
	net: fix kernel-doc warnings for socket.c
	net: prevent rewrite of msg_name in sock_sendmsg()
	RDMA/cxgb4: Check skb value for failure to allocate
	HID: logitech-hidpp: Fix kernel crash on receiver USB disconnect
	quota: Fix slow quotaoff
	net: prevent address rewrite in kernel_bind()
	drm: etvnaviv: fix bad backport leading to warning
	drm/msm/dsi: skip the wait for video mode done if not applicable
	ieee802154: ca8210: Fix a potential UAF in ca8210_probe
	xen-netback: use default TX queue size for vifs
	drm/vmwgfx: fix typo of sizeof argument
	ixgbe: fix crash with empty VF macvlan list
	net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
	nfc: nci: assert requested protocol is valid
	workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()
	sched,idle,rcu: Push rcu_idle deeper into the idle path
	dmaengine: stm32-mdma: abort resume if no ongoing transfer
	usb: xhci: xhci-ring: Use sysdev for mapping bounce buffer
	net: usb: dm9601: fix uninitialized variable use in dm9601_mdio_read
	usb: dwc3: Soft reset phy on probe for host
	usb: musb: Get the musb_qh poniter after musb_giveback
	usb: musb: Modify the "HWVers" register address
	iio: pressure: bmp280: Fix NULL pointer exception
	iio: pressure: ms5611: ms5611_prom_is_valid false negative bug
	mcb: remove is_added flag from mcb_device struct
	ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
	Input: powermate - fix use-after-free in powermate_config_complete
	Input: psmouse - fix fast_reconnect function for PS/2 mode
	Input: xpad - add PXN V900 support
	cgroup: Remove duplicates in cgroup v1 tasks file
	pinctrl: avoid unsafe code pattern in find_pinctrl()
	x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs
	usb: gadget: udc-xilinx: replace memcpy with memcpy_toio
	usb: gadget: ncm: Handle decoding of multiple NTB's in unwrap call
	powerpc/64e: Fix wrong test in __ptep_test_and_clear_young()
	x86/alternatives: Disable KASAN in apply_alternatives()
	dev_forward_skb: do not scrub skb mark within the same name space
	usb: hub: Guard against accesses to uninitialized BOS descriptors
	Bluetooth: hci_event: Ignore NULL link key
	Bluetooth: Reject connection with the device which has same BD_ADDR
	Bluetooth: Fix a refcnt underflow problem for hci_conn
	Bluetooth: vhci: Fix race when opening vhci device
	Bluetooth: hci_event: Fix coding style
	Bluetooth: avoid memcmp() out of bounds warning
	nfc: nci: fix possible NULL pointer dereference in send_acknowledge()
	regmap: fix NULL deref on lookup
	KVM: x86: Mask LVTPC when handling a PMI
	netfilter: nft_payload: fix wrong mac header matching
	xfrm: fix a data-race in xfrm_gen_index()
	xfrm: interface: use DEV_STATS_INC()
	net: ipv4: fix return value check in esp_remove_trailer
	net: ipv6: fix return value check in esp_remove_trailer
	net: rfkill: gpio: prevent value glitch during probe
	tcp: fix excessive TLP and RACK timeouts from HZ rounding
	tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb
	net: usb: smsc95xx: Fix an error code in smsc95xx_reset()
	i40e: prevent crash on probe if hw registers have invalid values
	net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve
	netfilter: nft_set_rbtree: .deactivate fails if element has expired
	net: pktgen: Fix interface flags printing
	libceph: fix unaligned accesses in ceph_entity_addr handling
	libceph: use kernel_connect()
	ARM: dts: ti: omap: Fix noisy serial with overrun-throttle-ms for mapphone
	btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1
	btrfs: initialize start_slot in btrfs_log_prealloc_extents
	i2c: mux: Avoid potential false error message in i2c_mux_add_adapter
	overlayfs: set ctime when setting mtime and atime
	gpio: timberdale: Fix potential deadlock on &tgpio->lock
	ata: libata-eh: Fix compilation warning in ata_eh_link_report()
	tracing: relax trace_event_eval_update() execution with cond_resched()
	HID: holtek: fix slab-out-of-bounds Write in holtek_kbd_input_event
	Bluetooth: Avoid redundant authentication
	Bluetooth: hci_core: Fix build warnings
	wifi: mac80211: allow transmitting EAPOL frames with tainted key
	wifi: cfg80211: avoid leaking stack data into trace
	sky2: Make sure there is at least one frag_addr available
	drm: panel-orientation-quirks: Add quirk for One Mix 2S
	btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.c
	Bluetooth: hci_event: Fix using memcmp when comparing keys
	mtd: rawnand: qcom: Unmap the right resource upon probe failure
	mtd: spinand: micron: correct bitmask for ecc status
	mmc: core: Capture correct oemid-bits for eMMC cards
	Revert "pinctrl: avoid unsafe code pattern in find_pinctrl()"
	ACPI: irq: Fix incorrect return value in acpi_register_gsi()
	USB: serial: option: add Telit LE910C4-WWX 0x1035 composition
	USB: serial: option: add entry for Sierra EM9191 with new firmware
	USB: serial: option: add Fibocom to DELL custom modem FM101R-GL
	perf: Disallow mis-matched inherited group reads
	s390/pci: fix iommu bitmap allocation
	gpio: vf610: set value before the direction to avoid a glitch
	ASoC: pxa: fix a memory leak in probe()
	phy: mapphone-mdm6600: Fix runtime PM for remove
	Bluetooth: hci_sock: fix slab oob read in create_monitor_event
	Bluetooth: hci_sock: Correctly bounds check and pad HCI_MON_NEW_INDEX name
	xfrm6: fix inet6_dev refcount underflow problem
	Linux 4.19.297

Change-Id: I495e8b8fbb6416ec3f94094fa905bdde364618b4
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-10-25 11:43:43 +00:00
Neal Cardwell
6d022a7abf tcp: fix excessive TLP and RACK timeouts from HZ rounding
commit 1c2709cfff1dedbb9591e989e2f001484208d914 upstream.

We discovered from packet traces of slow loss recovery on kernels with
the default HZ=250 setting (and min_rtt < 1ms) that after reordering,
when receiving a SACKed sequence range, the RACK reordering timer was
firing after about 16ms rather than the desired value of roughly
min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer
calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels
with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the
exact same issue.

This commit fixes the TLP transmit timer and RACK reordering timer
floor calculation to more closely match the intended 2ms floor even on
kernels with HZ=250. It does this by adding in a new
TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies,
instead of the current approach of converting to jiffies and then
adding th TCP_TIMEOUT_MIN value of 2 jiffies.

Our testing has verified that on kernels with HZ=1000, as expected,
this does not produce significant changes in behavior, but on kernels
with the default HZ=250 the latency improvement can be large. For
example, our tests show that for HZ=250 kernels at low RTTs this fix
roughly halves the latency for the RACK reorder timer: instead of
mostly firing at 16ms it mostly fires at 8ms.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: bb4d991a28 ("tcp: adjust tail loss probe timeout")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25 11:16:46 +02:00
Greg Kroah-Hartman
7f2e810705 Merge 4.19.296 into android-4.19-stable
Changes in 4.19.296
	NFS/pNFS: Report EINVAL errors from connect() to the server
	ata: ahci: Drop pointless VPRINTK() calls and convert the remaining ones
	ata: libahci: clear pending interrupt status
	netfilter: nf_tables: disallow element removal on anonymous sets
	selftests/tls: Add {} to avoid static checker warning
	selftests: tls: swap the TX and RX sockets in some tests
	ipv4: fix null-deref in ipv4_link_failure
	powerpc/perf/hv-24x7: Update domain value check
	net: hns3: add 5ms delay before clear firmware reset irq source
	net: add atomic_long_t to net_device_stats fields
	net: bridge: use DEV_STATS_INC()
	team: fix null-ptr-deref when team device type is changed
	gpio: tb10x: Fix an error handling path in tb10x_gpio_probe()
	i2c: mux: demux-pinctrl: check the return value of devm_kstrdup()
	Input: i8042 - add quirk for TUXEDO Gemini 17 Gen1/Clevo PD70PN
	scsi: qla2xxx: Add protection mask module parameters
	scsi: qla2xxx: Remove unsupported ql2xenabledif option
	scsi: megaraid_sas: Load balance completions across all MSI-X
	scsi: megaraid_sas: Fix deadlock on firmware crashdump
	ext4: remove the 'group' parameter of ext4_trim_extent
	ext4: add new helper interface ext4_try_to_trim_range()
	ext4: scope ret locally in ext4_try_to_trim_range()
	ext4: change s_last_trim_minblks type to unsigned long
	ext4: mark group as trimmed only if it was fully scanned
	ext4: replace the traditional ternary conditional operator with with max()/min()
	ext4: move setting of trimmed bit into ext4_try_to_trim_range()
	ext4: do not let fstrim block system suspend
	MIPS: Alchemy: only build mmc support helpers if au1xmmc is enabled
	clk: tegra: fix error return case for recalc_rate
	ARM: dts: ti: omap: motorola-mapphone: Fix abe_clkctrl warning on boot
	gpio: pmic-eic-sprd: Add can_sleep flag for PMIC EIC chip
	parisc: sba: Fix compile warning wrt list of SBA devices
	parisc: iosapic.c: Fix sparse warnings
	parisc: drivers: Fix sparse warning
	parisc: irq: Make irq_stack_union static to avoid sparse warning
	selftests/ftrace: Correctly enable event in instance-event.tc
	ring-buffer: Avoid softlockup in ring_buffer_resize()
	ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()
	bpf: Clarify error expectations from bpf_clone_redirect
	fbdev/sh7760fb: Depend on FB=y
	nvme-pci: do not set the NUMA node of device if it has none
	watchdog: iTCO_wdt: No need to stop the timer in probe
	watchdog: iTCO_wdt: Set NO_REBOOT if the watchdog is not already running
	net: Fix unwanted sign extension in netdev_stats_to_stats64()
	scsi: megaraid_sas: Enable msix_load_balance for Invader and later controllers
	Smack:- Use overlay inode label in smack_inode_copy_up()
	smack: Retrieve transmuting information in smack_inode_getsecurity()
	smack: Record transmuting in smk_transmuted
	serial: 8250_port: Check IRQ data before use
	nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
	ALSA: hda: Disable power save for solving pop issue on Lenovo ThinkCentre M70q
	ata: libata-scsi: ignore reserved bits for REPORT SUPPORTED OPERATION CODES
	i2c: i801: unregister tco_pdev in i801_probe() error path
	btrfs: properly report 0 avail for very full file systems
	net: thunderbolt: Fix TCPv6 GSO checksum calculation
	ata: libata-core: Fix ata_port_request_pm() locking
	ata: libata-core: Fix port and device removal
	ata: libata-core: Do not register PM operations for SAS ports
	ata: libata-sata: increase PMP SRST timeout to 10s
	fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
	ext4: fix rec_len verify error
	ata: libata: disallow dev-initiated LPM transitions to unsupported states
	Revert "drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions"
	media: dvb: symbol fixup for dvb_attach() - again
	Revert "PCI: qcom: Disable write access to read only registers for IP v2.3.3"
	scsi: zfcp: Fix a double put in zfcp_port_enqueue()
	qed/red_ll2: Fix undefined behavior bug in struct qed_ll2_info
	wifi: mwifiex: Fix tlv_buf_left calculation
	net: replace calls to sock->ops->connect() with kernel_connect()
	ubi: Refuse attaching if mtd's erasesize is 0
	wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet
	drivers/net: process the result of hdlc_open() and add call of hdlc_close() in uhdlc_close()
	regmap: rbtree: Fix wrong register marked as in-cache when creating new node
	scsi: target: core: Fix deadlock due to recursive locking
	modpost: add missing else to the "of" check
	ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data()
	net: usb: smsc75xx: Fix uninit-value access in __smsc75xx_read_reg
	net: stmmac: dwmac-stm32: fix resume on STM32 MCU
	tcp: fix quick-ack counting to count actual ACKs of new data
	tcp: fix delayed ACKs for MSS boundary condition
	sctp: update transport state when processing a dupcook packet
	sctp: update hb timer immediately after users change hb_interval
	cpupower: add Makefile dependencies for install targets
	IB/mlx4: Fix the size of a buffer in add_port_entries()
	gpio: aspeed: fix the GPIO number passed to pinctrl_gpio_set_config()
	gpio: pxa: disable pinctrl calls for MMP_GPIO
	RDMA/cma: Fix truncation compilation warning in make_cma_ports
	RDMA/mlx5: Fix NULL string error
	parisc: Restore __ldcw_align for PA-RISC 2.0 processors
	dccp: fix dccp_v4_err()/dccp_v6_err() again
	Revert "rtnetlink: Reject negative ifindexes in RTM_NEWLINK"
	rtnetlink: Reject negative ifindexes in RTM_NEWLINK
	xen/events: replace evtchn_rwlock with RCU
	Linux 4.19.296

Change-Id: I4f76959ead91691fe9a242cb5b158fc8dc67bf39
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-10-11 19:17:39 +00:00
Neal Cardwell
b86bfa8334 tcp: fix quick-ack counting to count actual ACKs of new data
[ Upstream commit 059217c18be6757b95bfd77ba53fb50b48b8a816 ]

This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.

The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.

When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.

And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.

The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.

Fixes: fc6415bcb0 ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10 21:45:01 +02:00
Greg Kroah-Hartman
501b721387 Merge 4.19.295 into android-4.19-stable
Changes in 4.19.295
	erofs: ensure that the post-EOF tails are all zeroed
	ARM: pxa: remove use of symbol_get()
	mmc: au1xmmc: force non-modular build and remove symbol_get usage
	rtc: ds1685: use EXPORT_SYMBOL_GPL for ds1685_rtc_poweroff
	modules: only allow symbol_get of EXPORT_SYMBOL_GPL modules
	USB: serial: option: add Quectel EM05G variant (0x030e)
	USB: serial: option: add FOXCONN T99W368/T99W373 product
	HID: wacom: remove the battery when the EKR is off
	Bluetooth: btsdio: fix use after free bug in btsdio_remove due to race condition
	serial: sc16is7xx: fix bug when first setting GPIO direction
	fsi: master-ast-cf: Add MODULE_FIRMWARE macro
	nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()
	nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
	pinctrl: amd: Don't show `Invalid config param` errors
	9p: virtio: make sure 'offs' is initialized in zc_request
	ASoC: da7219: Flush pending AAD IRQ when suspending
	ASoC: da7219: Check for failure reading AAD IRQ events
	ethernet: atheros: fix return value check in atl1c_tso_csum()
	vxlan: generalize vxlan_parse_gpe_hdr and remove unused args
	m68k: Fix invalid .section syntax
	s390/dasd: use correct number of retries for ERP requests
	s390/dasd: fix hanging device after request requeue
	fs/nls: make load_nls() take a const parameter
	ASoc: codecs: ES8316: Fix DMIC config
	ASoC: atmel: Fix the 8K sample parameter in I2SC master
	platform/x86: intel: hid: Always call BTNL ACPI method
	security: keys: perform capable check only on privileged operations
	net: usb: qmi_wwan: add Quectel EM05GV2
	idmaengine: make FSL_EDMA and INTEL_IDMA64 depends on HAS_IOMEM
	scsi: qedi: Fix potential deadlock on &qedi_percpu->p_work_lock
	netlabel: fix shift wrapping bug in netlbl_catmap_setlong()
	bnx2x: fix page fault following EEH recovery
	sctp: handle invalid error codes without calling BUG()
	cifs: add a warning when the in-flight count goes negative
	ALSA: seq: oss: Fix racy open/close of MIDI devices
	net: Avoid address overwrite in kernel_connect
	powerpc/32: Include .branch_lt in data section
	powerpc/32s: Fix assembler warning about r0
	udf: Check consistency of Space Bitmap Descriptor
	udf: Handle error when adding extent to a file
	Revert "net: macsec: preserve ingress frame ordering"
	reiserfs: Check the return value from __getblk()
	eventfd: Export eventfd_ctx_do_read()
	eventfd: prevent underflow for eventfd semaphores
	new helper: lookup_positive_unlocked()
	netfilter: nft_flow_offload: fix underflow in flowtable reference counter
	netfilter: nf_tables: missing NFT_TRANS_PREPARE_ERROR in flowtable deactivatation
	fs: Fix error checking for d_hash_and_lookup()
	cpufreq: powernow-k8: Use related_cpus instead of cpus in driver.exit()
	bpf: Clear the probe_addr for uprobe
	tcp: tcp_enter_quickack_mode() should be static
	regmap: rbtree: Use alloc_flags for memory allocations
	spi: tegra20-sflash: fix to check return value of platform_get_irq() in tegra_sflash_probe()
	can: gs_usb: gs_usb_receive_bulk_callback(): count RX overflow errors also in case of OOM
	wifi: mwifiex: Fix OOB and integer underflow when rx packets
	mwifiex: drop 'set_consistent_dma_mask' log message
	mwifiex: switch from 'pci_' to 'dma_' API
	wifi: mwifiex: fix error recovery in PCIE buffer descriptor management
	Bluetooth: nokia: fix value check in nokia_bluetooth_serdev_probe()
	crypto: caam - fix unchecked return value error
	lwt: Check LWTUNNEL_XMIT_CONTINUE strictly
	fs: ocfs2: namei: check return value of ocfs2_add_entry()
	wifi: mwifiex: fix memory leak in mwifiex_histogram_read()
	wifi: mwifiex: Fix missed return in oob checks failed path
	wifi: ath9k: fix races between ath9k_wmi_cmd and ath9k_wmi_ctrl_rx
	wifi: ath9k: protect WMI command response buffer replacement with a lock
	wifi: mwifiex: avoid possible NULL skb pointer dereference
	wifi: ath9k: use IS_ERR() with debugfs_create_dir()
	net: arcnet: Do not call kfree_skb() under local_irq_disable()
	net/sched: sch_hfsc: Ensure inner classes have fsc curve
	netrom: Deny concurrent connect().
	quota: add dqi_dirty_list description to comment of Dquot List Management
	quota: avoid increasing DQST_LOOKUPS when iterating over dirty/inuse list
	quota: factor out dquot_write_dquot()
	quota: rename dquot_active() to inode_quota_active()
	quota: add new helper dquot_active()
	quota: fix dqput() to follow the guarantees dquot_srcu should provide
	arm64: dts: msm8996: thermal: Add interrupt support
	arm64: dts: qcom: msm8996: Add missing interrupt to the USB2 controller
	drm/amdgpu: avoid integer overflow warning in amdgpu_device_resize_fb_bar()
	ARM: dts: BCM5301X: Harmonize EHCI/OHCI DT nodes name
	ARM: dts: BCM53573: Describe on-SoC BCM53125 rev 4 switch
	ARM: dts: BCM53573: Drop nonexistent #usb-cells
	ARM: dts: BCM53573: Add cells sizes to PCIe node
	ARM: dts: BCM53573: Use updated "spi-gpio" binding properties
	ARM: dts: s3c6410: move fixed clocks under root node in Mini6410
	ARM: dts: s3c6410: align node SROM bus node name with dtschema in Mini6410
	ARM: dts: s3c64xx: align pinctrl with dtschema
	ARM: dts: samsung: s3c6410-mini6410: correct ethernet reg addresses (split)
	ARM: dts: s5pv210: add RTC 32 KHz clock in SMDKV210
	ARM: dts: s5pv210: use defines for IRQ flags in SMDKV210
	ARM: dts: s5pv210: correct ethernet unit address in SMDKV210
	ARM: dts: s5pv210: add dummy 5V regulator for backlight on SMDKv210
	ARM: dts: samsung: s5pv210-smdkv210: correct ethernet reg addresses (split)
	drm: adv7511: Fix low refresh rate register for ADV7533/5
	ARM: dts: BCM53573: Fix Ethernet info for Luxul devices
	drm/tegra: Remove superfluous error messages around platform_get_irq()
	drm/tegra: dpaux: Fix incorrect return value of platform_get_irq
	of: unittest: fix null pointer dereferencing in of_unittest_find_node_by_name()
	drm/msm: Replace drm_framebuffer_{un/reference} with put, get functions
	drm/msm/mdp5: Don't leak some plane state
	smackfs: Prevent underflow in smk_set_cipso()
	audit: fix possible soft lockup in __audit_inode_child()
	of: unittest: Fix overlay type in apply/revert check
	ALSA: ac97: Fix possible error value of *rac97
	drivers: clk: keystone: Fix parameter judgment in _of_pll_clk_init()
	clk: sunxi-ng: Modify mismatched function name
	PCI: Mark NVIDIA T4 GPUs to avoid bus reset
	PCI: pciehp: Use RMW accessors for changing LNKCTL
	PCI/ASPM: Use RMW accessors for changing LNKCTL
	PCI/ATS: Add pci_prg_resp_pasid_required() interface.
	PCI: Cleanup register definition width and whitespace
	PCI: Decode PCIe 32 GT/s link speed
	PCI: Add #defines for Enter Compliance, Transmit Margin
	drm/amdgpu: Correct Transmit Margin masks
	drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions
	drm/amdgpu: Prefer pcie_capability_read_word()
	drm/amdgpu: Use RMW accessors for changing LNKCTL
	drm/radeon: Correct Transmit Margin masks
	drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions
	drm/radeon: Prefer pcie_capability_read_word()
	drm/radeon: Use RMW accessors for changing LNKCTL
	wifi: ath10k: Use RMW accessors for changing LNKCTL
	nfs/blocklayout: Use the passed in gfp flags
	powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
	jfs: validate max amount of blocks before allocation.
	fs: lockd: avoid possible wrong NULL parameter
	NFSD: da_addr_body field missing in some GETDEVICEINFO replies
	media: Use of_node_name_eq for node name comparisons
	media: v4l2-fwnode: fix v4l2_fwnode_parse_link handling
	media: v4l2-fwnode: simplify v4l2_fwnode_parse_link
	media: v4l2-core: Fix a potential resource leak in v4l2_fwnode_parse_link()
	drivers: usb: smsusb: fix error handling code in smsusb_init_device
	media: dib7000p: Fix potential division by zero
	media: dvb-usb: m920x: Fix a potential memory leak in m920x_i2c_xfer()
	media: cx24120: Add retval check for cx24120_message_send()
	media: mediatek: vcodec: Return NULL if no vdec_fb is found
	usb: phy: mxs: fix getting wrong state with mxs_phy_is_otg_host()
	scsi: iscsi: Add strlen() check in iscsi_if_set{_host}_param()
	scsi: be2iscsi: Add length check when parsing nlattrs
	scsi: qla4xxx: Add length check when parsing nlattrs
	x86/APM: drop the duplicate APM_MINOR_DEV macro
	scsi: qedf: Do not touch __user pointer in qedf_dbg_stop_io_on_error_cmd_read() directly
	scsi: qedf: Do not touch __user pointer in qedf_dbg_fp_int_cmd_read() directly
	dma-buf/sync_file: Fix docs syntax
	IB/uverbs: Fix an potential error pointer dereference
	media: go7007: Remove redundant if statement
	USB: gadget: f_mass_storage: Fix unused variable warning
	media: i2c: ov2680: Set V4L2_CTRL_FLAG_MODIFY_LAYOUT on flips
	media: ov2680: Remove auto-gain and auto-exposure controls
	media: ov2680: Fix ov2680_bayer_order()
	media: ov2680: Fix vflip / hflip set functions
	media: ov2680: Fix regulators being left enabled on ov2680_power_on() errors
	cgroup:namespace: Remove unused cgroup_namespaces_init()
	scsi: core: Use 32-bit hostnum in scsi_host_lookup()
	scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock
	serial: tegra: handle clk prepare error in tegra_uart_hw_init()
	amba: bus: fix refcount leak
	Revert "IB/isert: Fix incorrect release of isert connection"
	HID: multitouch: Correct devm device reference for hidinput input_dev name
	rpmsg: glink: Add check for kstrdup
	arch: um: drivers: Kconfig: pedantic formatting
	um: Fix hostaudio build errors
	dmaengine: ste_dma40: Add missing IRQ check in d40_probe
	igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU
	netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
	netfilter: xt_u32: validate user space input
	netfilter: xt_sctp: validate the flag_info count
	skbuff: skb_segment, Call zero copy functions before using skbuff frags
	igb: set max size RX buffer when store bad packet is enabled
	PM / devfreq: Fix leak in devfreq_dev_release()
	ALSA: pcm: Fix missing fixup call in compat hw_refine ioctl
	ipmi_si: fix a memleak in try_smi_init()
	ARM: OMAP2+: Fix -Warray-bounds warning in _pwrdm_state_switch()
	backlight/gpio_backlight: Compare against struct fb_info.device
	backlight/bd6107: Compare against struct fb_info.device
	backlight/lv5207lp: Compare against struct fb_info.device
	media: dvb: symbol fixup for dvb_attach()
	ntb: Drop packets when qp link is down
	ntb: Clean up tx tail index on link down
	ntb: Fix calculation ntb_transport_tx_free_entry()
	Revert "PCI: Mark NVIDIA T4 GPUs to avoid bus reset"
	procfs: block chmod on /proc/thread-self/comm
	parisc: Fix /proc/cpuinfo output for lscpu
	dlm: fix plock lookup when using multiple lockspaces
	dccp: Fix out of bounds access in DCCP error handler
	crypto: stm32 - fix loop iterating through scatterlist for DMA
	cpufreq: brcmstb-avs-cpufreq: Fix -Warray-bounds bug
	X.509: if signature is unsupported skip validation
	net: handle ARPHRD_PPP in dev_is_mac_header_xmit()
	pstore/ram: Check start of empty przs during init
	PCI/ATS: Add inline to pci_prg_resp_pasid_required()
	sc16is7xx: Set iobase to device index
	serial: sc16is7xx: fix broken port 0 uart init
	usb: typec: tcpci: clear the fault status bit
	udf: initialize newblock to 0
	scsi: qla2xxx: fix inconsistent TMF timeout
	scsi: qla2xxx: Turn off noisy message log
	fbdev/ep93xx-fb: Do not assign to struct fb_info.dev
	drm/ast: Fix DRAM init on AST2200
	parisc: led: Fix LAN receive and transmit LEDs
	parisc: led: Reduce CPU overhead for disk & lan LED computation
	clk: qcom: gcc-mdm9615: use proper parent for pll0_vote clock
	soc: qcom: qmi_encdec: Restrict string length in decode
	NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
	kconfig: fix possible buffer overflow
	x86/virt: Drop unnecessary check on extended CPUID level in cpu_has_svm()
	watchdog: intel-mid_wdt: add MODULE_ALIAS() to allow auto-load
	pwm: lpc32xx: Remove handling of PWM channels
	net: read sk->sk_family once in sk_mc_loop()
	igb: disable virtualization features on 82580
	veth: Fixing transmit return status for dropped packets
	net: ipv6/addrconf: avoid integer underflow in ipv6_create_tempaddr
	af_unix: Fix data-races around user->unix_inflight.
	af_unix: Fix data-race around unix_tot_inflight.
	af_unix: Fix data-races around sk->sk_shutdown.
	af_unix: Fix data race around sk->sk_err.
	net: sched: sch_qfq: Fix UAF in qfq_dequeue()
	kcm: Destroy mutex in kcm_exit_net()
	igbvf: Change IGBVF_MIN to allow set rx/tx value between 64 and 80
	igb: Change IGB_MIN to allow set rx/tx value between 64 and 80
	idr: fix param name in idr_alloc_cyclic() doc
	netfilter: nfnetlink_osf: avoid OOB read
	ata: sata_gemini: Add missing MODULE_DESCRIPTION
	ata: pata_ftide010: Add missing MODULE_DESCRIPTION
	btrfs: don't start transaction when joining with TRANS_JOIN_NOSTART
	mtd: rawnand: brcmnand: Fix crash during the panic_write
	mtd: rawnand: brcmnand: Fix potential out-of-bounds access in oob write
	mtd: rawnand: brcmnand: Fix potential false time out warning
	perf hists browser: Fix hierarchy mode header
	net: ethernet: mtk_eth_soc: fix possible NULL pointer dereference in mtk_hwlro_get_fdir_all()
	kcm: Fix memory leak in error path of kcm_sendmsg()
	ixgbe: fix timestamp configuration code
	kcm: Fix error handling for SOCK_DGRAM in kcm_sendmsg().
	drm/amd/display: Fix a bug when searching for insert_above_mpcc
	parisc: Drop loops_per_jiffy from per_cpu struct
	autofs: fix memory leak of waitqueues in autofs_catatonic_mode
	btrfs: output extra debug info if we failed to find an inline backref
	ACPICA: Add AML_NO_OPERAND_RESOLVE flag to Timer
	ACPI: video: Add backlight=native DMI quirk for Lenovo Ideapad Z470
	hw_breakpoint: fix single-stepping when using bpf_overflow_handler
	wifi: ath9k: fix printk specifier
	wifi: mwifiex: fix fortify warning
	crypto: lib/mpi - avoid null pointer deref in mpi_cmp_ui()
	tpm_tis: Resend command to recover from data transfer errors
	alx: fix OOB-read compiler warning
	drm/exynos: fix a possible null-pointer dereference due to data race in exynos_drm_crtc_atomic_disable()
	md: raid1: fix potential OOB in raid1_remove_disk()
	ext2: fix datatype of block number in ext2_xattr_set2()
	fs/jfs: prevent double-free in dbUnmount() after failed jfs_remount()
	jfs: fix invalid free of JFS_IP(ipimap)->i_imap in diUnmount
	powerpc/pseries: fix possible memory leak in ibmebus_bus_init()
	media: dvb-usb-v2: af9035: Fix null-ptr-deref in af9035_i2c_master_xfer
	media: dw2102: Fix null-ptr-deref in dw2102_i2c_transfer()
	media: af9005: Fix null-ptr-deref in af9005_i2c_xfer
	media: anysee: fix null-ptr-deref in anysee_master_xfer
	media: az6007: Fix null-ptr-deref in az6007_i2c_xfer()
	iio: core: Use min() instead of min_t() to make code more robust
	media: tuners: qt1010: replace BUG_ON with a regular error
	media: pci: cx23885: replace BUG with error return
	usb: gadget: fsl_qe_udc: validate endpoint index for ch9 udc
	scsi: target: iscsi: Fix buffer overflow in lio_target_nacl_info_show()
	serial: cpm_uart: Avoid suspicious locking
	media: pci: ipu3-cio2: Initialise timing struct to avoid a compiler warning
	kobject: Add sanity check for kset->kobj.ktype in kset_register()
	md/raid1: fix error: ISO C90 forbids mixed declarations
	attr: block mode changes of symlinks
	btrfs: fix lockdep splat and potential deadlock after failure running delayed items
	nfsd: fix change_info in NFSv4 RENAME replies
	mtd: rawnand: brcmnand: Fix ECC level field setting for v7.2 controller
	net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
	net/sched: Retire rsvp classifier
	Linux 4.19.295

Change-Id: I5de88dc1e8cebe5736df3023205233cb40c4aa35
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-09-30 11:47:07 +00:00
Eric Dumazet
0f9ca28a3a tcp: tcp_enter_quickack_mode() should be static
[ Upstream commit 03b123debcbc8db987bda17ed8412cc011064c22 ]

After commit d2ccd7bc8a ("tcp: avoid resetting ACK timer in DCTCP"),
tcp_enter_quickack_mode() is only used from net/ipv4/tcp_input.c.

Fixes: d2ccd7bc8a ("tcp: avoid resetting ACK timer in DCTCP")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20230718162049.1444938-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23 10:48:00 +02:00
Greg Kroah-Hartman
813e482b1b Merge 4.19.291 into android-4.19-stable
Changes in 4.19.291
	gfs2: Don't deref jdesc in evict
	x86/smp: Use dedicated cache-line for mwait_play_dead()
	video: imsttfb: check for ioremap() failures
	fbdev: imsttfb: Fix use after free bug in imsttfb_probe
	drm/edid: Fix uninitialized variable in drm_cvt_modes()
	scripts/tags.sh: Resolve gtags empty index generation
	drm/amdgpu: Validate VM ioctl flags.
	treewide: Remove uninitialized_var() usage
	md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
	md/raid10: fix overflow of md/safe_mode_delay
	md/raid10: fix wrong setting of max_corr_read_errors
	md/raid10: fix io loss while replacement replace rdev
	irqchip/jcore-aic: Kill use of irq_create_strict_mappings()
	irqchip/jcore-aic: Fix missing allocation of IRQ descriptors
	clocksource/drivers: Unify the names to timer-* format
	clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
	clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe
	PM: domains: fix integer overflow issues in genpd_parse_state()
	ARM: 9303/1: kprobes: avoid missing-declaration warnings
	evm: Complete description of evm_inode_setattr()
	wifi: ath9k: fix AR9003 mac hardware hang check register offset calculation
	wifi: ath9k: avoid referencing uninit memory in ath9k_wmi_ctrl_rx
	samples/bpf: Fix buffer overflow in tcp_basertt
	wifi: mwifiex: Fix the size of a memory allocation in mwifiex_ret_802_11_scan()
	nfc: constify several pointers to u8, char and sk_buff
	nfc: llcp: fix possible use of uninitialized variable in nfc_llcp_send_connect()
	wifi: orinoco: Fix an error handling path in spectrum_cs_probe()
	wifi: orinoco: Fix an error handling path in orinoco_cs_probe()
	wifi: atmel: Fix an error handling path in atmel_probe()
	wl3501_cs: Fix a bunch of formatting issues related to function docs
	wl3501_cs: Remove unnecessary NULL check
	wl3501_cs: Fix misspelling and provide missing documentation
	net: create netdev->dev_addr assignment helpers
	wl3501_cs: use eth_hw_addr_set()
	wifi: wl3501_cs: Fix an error handling path in wl3501_probe()
	wifi: ray_cs: Utilize strnlen() in parse_addr()
	wifi: ray_cs: Drop useless status variable in parse_addr()
	wifi: ray_cs: Fix an error handling path in ray_probe()
	wifi: ath9k: don't allow to overwrite ENDPOINT0 attributes
	wifi: rsi: Do not set MMC_PM_KEEP_POWER in shutdown
	watchdog/perf: define dummy watchdog_update_hrtimer_threshold() on correct config
	watchdog/perf: more properly prevent false positives with turbo modes
	kexec: fix a memory leak in crash_shrink_memory()
	memstick r592: make memstick_debug_get_tpc_name() static
	wifi: ath9k: Fix possible stall on ath9k_txq_list_has_key()
	wifi: ath9k: convert msecs to jiffies where needed
	netlink: fix potential deadlock in netlink_set_err()
	netlink: do not hard code device address lenth in fdb dumps
	gtp: Fix use-after-free in __gtp_encap_destroy().
	lib/ts_bm: reset initial match offset for every block of text
	netfilter: nf_conntrack_sip: fix the ct_sip_parse_numerical_param() return value.
	ipvlan: Fix return value of ipvlan_queue_xmit()
	netlink: Add __sock_i_ino() for __netlink_diag_dump().
	radeon: avoid double free in ci_dpm_init()
	Input: drv260x - sleep between polling GO bit
	ARM: dts: BCM5301X: Drop "clock-names" from the SPI node
	Input: adxl34x - do not hardcode interrupt trigger type
	drm/panel: simple: fix active size for Ampire AM-480272H3TMQW-T01H
	ARM: ep93xx: fix missing-prototype warnings
	ASoC: es8316: Increment max value for ALC Capture Target Volume control
	soc/fsl/qe: fix usb.c build errors
	IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors
	arm64: dts: renesas: ulcb-kf: Remove flow control for SCIF1
	fbdev: omapfb: lcd_mipid: Fix an error handling path in mipid_spi_probe()
	drm/radeon: fix possible division-by-zero errors
	ALSA: ac97: Fix possible NULL dereference in snd_ac97_mixer
	scsi: 3w-xxxx: Add error handling for initialization failure in tw_probe()
	PCI: Add pci_clear_master() stub for non-CONFIG_PCI
	pinctrl: cherryview: Return correct value if pin in push-pull mode
	perf dwarf-aux: Fix off-by-one in die_get_varname()
	pinctrl: at91-pio4: check return value of devm_kasprintf()
	hwrng: virtio - add an internal buffer
	hwrng: virtio - don't wait on cleanup
	hwrng: virtio - don't waste entropy
	hwrng: virtio - always add a pending request
	hwrng: virtio - Fix race on data_avail and actual data
	crypto: nx - fix build warnings when DEBUG_FS is not enabled
	modpost: fix section mismatch message for R_ARM_ABS32
	modpost: fix section mismatch message for R_ARM_{PC24,CALL,JUMP24}
	ARCv2: entry: comments about hardware auto-save on taken interrupts
	ARCv2: entry: push out the Z flag unclobber from common EXCEPTION_PROLOGUE
	ARCv2: entry: avoid a branch
	ARCv2: entry: rewrite to enable use of double load/stores LDD/STD
	ARC: define ASM_NL and __ALIGN(_STR) outside #ifdef __ASSEMBLY__ guard
	USB: serial: option: add LARA-R6 01B PIDs
	block: change all __u32 annotations to __be32 in affs_hardblocks.h
	w1: fix loop in w1_fini()
	sh: j2: Use ioremap() to translate device tree address into kernel memory
	media: usb: Check az6007_read() return value
	media: videodev2.h: Fix struct v4l2_input tuner index comment
	media: usb: siano: Fix warning due to null work_func_t function pointer
	extcon: Fix kernel doc of property fields to avoid warnings
	extcon: Fix kernel doc of property capability fields to avoid warnings
	usb: phy: phy-tahvo: fix memory leak in tahvo_usb_probe()
	mfd: rt5033: Drop rt5033-battery sub-device
	KVM: s390: fix KVM_S390_GET_CMMA_BITS for GFNs in memslot holes
	mfd: intel-lpss: Add missing check for platform_get_resource
	mfd: stmpe: Only disable the regulators if they are enabled
	rtc: st-lpc: Release some resources in st_rtc_probe() in case of error
	sctp: fix potential deadlock on &net->sctp.addr_wq_lock
	Add MODULE_FIRMWARE() for FIRMWARE_TG357766.
	spi: bcm-qspi: return error if neither hif_mspi nor mspi is available
	mailbox: ti-msgmgr: Fill non-message tx data fields with 0x0
	f2fs: fix error path handling in truncate_dnode()
	powerpc: allow PPC_EARLY_DEBUG_CPM only when SERIAL_CPM=y
	net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode
	tcp: annotate data races in __tcp_oow_rate_limited()
	net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX
	sh: dma: Fix DMA channel offset calculation
	i2c: xiic: Defer xiic_wakeup() and __xiic_start_xfer() in xiic_process()
	i2c: xiic: Don't try to handle more interrupt events after error
	ALSA: jack: Fix mutex call in snd_jack_report()
	NFSD: add encoding of op_recall flag for write delegation
	mmc: core: disable TRIM on Kingston EMMC04G-M627
	mmc: core: disable TRIM on Micron MTFC4GACAJCN-1M
	bcache: Remove unnecessary NULL point check in node allocations
	integrity: Fix possible multiple allocation in integrity_inode_get()
	jffs2: reduce stack usage in jffs2_build_xattr_subsystem()
	btrfs: fix race when deleting quota root from the dirty cow roots list
	ARM: orion5x: fix d2net gpio initialization
	spi: spi-fsl-spi: remove always-true conditional in fsl_spi_do_one_msg
	spi: spi-fsl-spi: relax message sanity checking a little
	spi: spi-fsl-spi: allow changing bits_per_word while CS is still active
	netfilter: nf_tables: fix nat hook table deletion
	netfilter: nf_tables: add rescheduling points during loop detection walks
	netfilter: nftables: add helper function to set the base sequence number
	netfilter: add helper function to set up the nfnetlink header and use it
	netfilter: nf_tables: use net_generic infra for transaction data
	netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
	netfilter: nf_tables: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
	netfilter: nf_tables: reject unbound anonymous set before commit phase
	netfilter: nf_tables: unbind non-anonymous set if rule construction fails
	netfilter: nf_tables: fix scheduling-while-atomic splat
	netfilter: conntrack: Avoid nf_ct_helper_hash uses after free
	netfilter: nf_tables: prevent OOB access in nft_byteorder_eval
	net: lan743x: Don't sleep in atomic context
	workqueue: clean up WORK_* constant types, clarify masking
	net: mvneta: fix txq_map in case of txq_number==1
	vrf: Increment Icmp6InMsgs on the original netdev
	icmp6: Fix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev().
	udp6: fix udp6_ehashfn() typo
	ntb: idt: Fix error handling in idt_pci_driver_init()
	NTB: amd: Fix error handling in amd_ntb_pci_driver_init()
	ntb: intel: Fix error handling in intel_ntb_pci_driver_init()
	NTB: ntb_transport: fix possible memory leak while device_register() fails
	NTB: ntb_tool: Add check for devm_kcalloc
	ipv6/addrconf: fix a potential refcount underflow for idev
	wifi: airo: avoid uninitialized warning in airo_get_rate()
	net/sched: make psched_mtu() RTNL-less safe
	pinctrl: amd: Fix mistake in handling clearing pins at startup
	pinctrl: amd: Detect internal GPIO0 debounce handling
	pinctrl: amd: Only use special debounce behavior for GPIO 0
	tpm: tpm_vtpm_proxy: fix a race condition in /dev/vtpmx creation
	net: bcmgenet: Ensure MDIO unregistration has clocks enabled
	SUNRPC: Fix UAF in svc_tcp_listen_data_ready()
	perf intel-pt: Fix CYC timestamps after standalone CBR
	ext4: fix wrong unit use in ext4_mb_clear_bb
	ext4: only update i_reserved_data_blocks on successful block allocation
	jfs: jfs_dmap: Validate db_l2nbperpage while mounting
	PCI/PM: Avoid putting EloPOS E2/S2/H2 PCIe Ports in D3cold
	PCI: Add function 1 DMA alias quirk for Marvell 88SE9235
	PCI: qcom: Disable write access to read only registers for IP v2.3.3
	PCI: rockchip: Assert PCI Configuration Enable bit after probe
	PCI: rockchip: Write PCI Device ID to correct register
	PCI: rockchip: Add poll and timeout to wait for PHY PLLs to be locked
	PCI: rockchip: Fix legacy IRQ generation for RK3399 PCIe endpoint core
	PCI: rockchip: Use u32 variable to access 32-bit registers
	misc: pci_endpoint_test: Free IRQs before removing the device
	misc: pci_endpoint_test: Re-init completion for every test
	md/raid0: add discard support for the 'original' layout
	fs: dlm: return positive pid value for F_GETLK
	serial: atmel: don't enable IRQs prematurely
	hwrng: imx-rngc - fix the timeout for init and self check
	ceph: don't let check_caps skip sending responses for revoke msgs
	meson saradc: fix clock divider mask length
	Revert "8250: add support for ASIX devices with a FIFO bug"
	tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error
	tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk
	ring-buffer: Fix deadloop issue on reading trace_pipe
	xtensa: ISS: fix call to split_if_spec
	scsi: qla2xxx: Wait for io return on terminate rport
	scsi: qla2xxx: Fix potential NULL pointer dereference
	scsi: qla2xxx: Check valid rport returned by fc_bsg_to_rport()
	scsi: qla2xxx: Pointer may be dereferenced
	drm/atomic: Fix potential use-after-free in nonblocking commits
	tracing/histograms: Add histograms to hist_vars if they have referenced variables
	perf probe: Add test for regression introduced by switch to die_get_decl_file()
	fuse: revalidate: don't invalidate if interrupted
	can: bcm: Fix UAF in bcm_proc_show()
	ext4: correct inline offset when handling xattrs in inode body
	debugobjects: Recheck debug_objects_enabled before reporting
	nbd: Add the maximum limit of allocated index in nbd_dev_add
	md: fix data corruption for raid456 when reshape restart while grow up
	md/raid10: prevent soft lockup while flush writes
	posix-timers: Ensure timer ID search-loop limit is valid
	sched/fair: Don't balance task to its current running CPU
	bpf: Address KCSAN report on bpf_lru_list
	wifi: wext-core: Fix -Wstringop-overflow warning in ioctl_standard_iw_point()
	wifi: iwlwifi: mvm: avoid baid size integer overflow
	igb: Fix igb_down hung on surprise removal
	spi: bcm63xx: fix max prepend length
	fbdev: imxfb: warn about invalid left/right margin
	pinctrl: amd: Use amd_pinconf_set() for all config options
	net: ethernet: ti: cpsw_ale: Fix cpsw_ale_get_field()/cpsw_ale_set_field()
	net:ipv6: check return value of pskb_trim()
	Revert "tcp: avoid the lookup process failing to get sk in ehash table"
	fbdev: au1200fb: Fix missing IRQ check in au1200fb_drv_probe
	llc: Don't drop packet from non-root netns.
	netfilter: nf_tables: fix spurious set element insertion failure
	netfilter: nf_tables: can't schedule in nft_chain_validate
	net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX
	tcp: annotate data-races around tp->linger2
	tcp: annotate data-races around rskq_defer_accept
	tcp: annotate data-races around tp->notsent_lowat
	tcp: annotate data-races around fastopenq.max_qlen
	tracing/histograms: Return an error if we fail to add histogram to hist_vars list
	gpio: tps68470: Make tps68470_gpio_output() always set the initial value
	bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set
	bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()
	bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent
	btrfs: fix extent buffer leak after tree mod log failure at split_node()
	ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
	ext4: Fix reusing stale buffer heads from last failed mounting
	PCI: Rework pcie_retrain_link() wait loop
	PCI/ASPM: Return 0 or -ETIMEDOUT from pcie_retrain_link()
	PCI/ASPM: Factor out pcie_wait_for_retrain()
	PCI/ASPM: Avoid link retraining race
	dlm: cleanup plock_op vs plock_xop
	dlm: rearrange async condition return
	fs: dlm: interrupt posix locks only when process is killed
	ftrace: Add information on number of page groups allocated
	ftrace: Check if pages were allocated before calling free_pages()
	ftrace: Store the order of pages allocated in ftrace_page
	ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()
	scsi: qla2xxx: Fix inconsistent format argument type in qla_os.c
	scsi: qla2xxx: Array index may go out of bound
	ext4: fix to check return value of freeze_bdev() in ext4_shutdown()
	i40e: Fix an NULL vs IS_ERR() bug for debugfs_create_dir()
	phy: hisilicon: Fix an out of bounds check in hisi_inno_phy_probe()
	ethernet: atheros: fix return value check in atl1e_tso_csum()
	ipv6 addrconf: fix bug where deleting a mngtmpaddr can create a new temporary address
	tcp: Reduce chance of collisions in inet6_hashfn().
	bonding: reset bond's flags when down link is P2P device
	team: reset team's flags when down link is P2P device
	platform/x86: msi-laptop: Fix rfkill out-of-sync on MSI Wind U100
	net/sched: mqprio: refactor nlattr parsing to a separate function
	net/sched: mqprio: add extack to mqprio_parse_nlattr()
	net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64
	benet: fix return value check in be_lancer_xmit_workarounds()
	RDMA/mlx4: Make check for invalid flags stricter
	drm/msm: Fix IS_ERR_OR_NULL() vs NULL check in a5xx_submit_in_rb()
	ASoC: fsl_spdif: Silence output on stop
	block: Fix a source code comment in include/uapi/linux/blkzoned.h
	dm raid: fix missing reconfig_mutex unlock in raid_ctr() error paths
	ata: pata_ns87415: mark ns87560_tf_read static
	ring-buffer: Fix wrong stat of cpu_buffer->read
	tracing: Fix warning in trace_buffered_event_disable()
	USB: serial: option: support Quectel EM060K_128
	USB: serial: option: add Quectel EC200A module support
	USB: serial: simple: add Kaufmann RKS+CAN VCP
	USB: serial: simple: sort driver entries
	can: gs_usb: gs_can_close(): add missing set of CAN state to CAN_STATE_STOPPED
	Revert "usb: dwc3: core: Enable AutoRetry feature in the controller"
	usb: dwc3: pci: skip BYT GPIO lookup table for hardwired phy
	usb: dwc3: don't reset device side if dwc3 was configured as host-only
	usb: ohci-at91: Fix the unhandle interrupt when resume
	USB: quirks: add quirk for Focusrite Scarlett
	usb: xhci-mtk: set the dma max_seg_size
	Documentation: security-bugs.rst: update preferences when dealing with the linux-distros group
	Documentation: security-bugs.rst: clarify CVE handling
	staging: ks7010: potential buffer overflow in ks_wlan_set_encode_ext()
	hwmon: (nct7802) Fix for temp6 (PECI1) processed even if PECI1 disabled
	btrfs: check for commit error at btrfs_attach_transaction_barrier()
	tpm_tis: Explicitly check for error code
	irq-bcm6345-l1: Do not assume a fixed block to cpu mapping
	serial: 8250_dw: split Synopsys DesignWare 8250 common functions
	serial: 8250_dw: Preserve original value of DLF register
	virtio-net: fix race between set queues and probe
	s390/dasd: fix hanging device after quiesce/resume
	ASoC: wm8904: Fill the cache for WM8904_ADC_TEST_0 register
	dm cache policy smq: ensure IO doesn't prevent cleaner policy progress
	drm/client: Fix memory leak in drm_client_target_cloned
	net/sched: cls_fw: Fix improper refcount update leads to use-after-free
	net/sched: sch_qfq: account for stab overhead in qfq_enqueue
	ASoC: cs42l51: fix driver to properly autoload with automatic module loading
	net/sched: cls_u32: Fix reference counter leak leading to overflow
	perf: Fix function pointer case
	loop: Select I/O scheduler 'none' from inside add_disk()
	word-at-a-time: use the same return type for has_zero regardless of endianness
	KVM: s390: fix sthyi error handling
	net/mlx5e: fix return value check in mlx5e_ipsec_remove_trailer()
	perf test uprobe_from_different_cu: Skip if there is no gcc
	net: sched: cls_u32: Fix match key mis-addressing
	net: add missing data-race annotations around sk->sk_peek_off
	net: add missing data-race annotation for sk_ll_usec
	net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
	net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
	ip6mr: Fix skb_under_panic in ip6mr_cache_report()
	tcp_metrics: fix addr_same() helper
	tcp_metrics: annotate data-races around tm->tcpm_stamp
	tcp_metrics: annotate data-races around tm->tcpm_lock
	tcp_metrics: annotate data-races around tm->tcpm_vals[]
	tcp_metrics: annotate data-races around tm->tcpm_net
	tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen
	scsi: zfcp: Defer fc_rport blocking until after ADISC response
	libceph: fix potential hang in ceph_osdc_notify()
	USB: zaurus: Add ID for A-300/B-500/C-700
	fs/sysv: Null check to prevent null-ptr-deref bug
	Bluetooth: L2CAP: Fix use-after-free in l2cap_sock_ready_cb
	net: usbnet: Fix WARNING in usbnet_start_xmit/usb_submit_urb
	ext2: Drop fragment support
	test_firmware: fix a memory leak with reqs buffer
	test_firmware: return ENOMEM instead of ENOSPC on failed memory allocation
	mtd: rawnand: omap_elm: Fix incorrect type in assignment
	powerpc/mm/altmap: Fix altmap boundary check
	PM / wakeirq: support enabling wake-up irq after runtime_suspend called
	PM: sleep: wakeirq: fix wake irq arming
	ARM: dts: imx6sll: Make ssi node name same as other platforms
	ARM: dts: imx: add usb alias
	ARM: dts: imx6sll: fixup of operating points
	ARM: dts: nxp/imx6sll: fix wrong property name in usbphy node
	drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
	arm64: dts: stratix10: fix incorrect I2C property for SCL signal
	drm/edid: fix objtool warning in drm_cvt_modes()
	Linux 4.19.291

Change-Id: I4f78e25efd18415989ecf5e227a17e05b0d6386c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-08-25 11:24:56 +00:00
Eric Dumazet
73e2b8a15e tcp: annotate data-races around tp->notsent_lowat
[ Upstream commit 1aeb87bc1440c5447a7fa2d6e3c2cca52cbd206b ]

tp->notsent_lowat can be read locklessly from do_tcp_getsockopt()
and tcp_poll().

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230719212857.3943972-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:45:27 +02:00
Cambda Zhu
457202c473 net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX
[ Upstream commit f0628c524fd188c3f9418e12478dfdfadacba815 ]

This patch changes the behavior of TCP_LINGER2 about its limit. The
sysctl_tcp_fin_timeout used to be the limit of TCP_LINGER2 but now it's
only the default value. A new macro named TCP_FIN_TIMEOUT_MAX is added
as the limit of TCP_LINGER2, which is 2 minutes.

Since TCP_LINGER2 used sysctl_tcp_fin_timeout as the default value
and the limit in the past, the system administrator cannot set the
default value for most of sockets and let some sockets have a greater
timeout. It might be a mistake that let the sysctl to be the limit of
the TCP_LINGER2. Maybe we can add a new sysctl to set the max of
TCP_LINGER2, but FIN-WAIT-2 timeout is usually no need to be too long
and 2 minutes are legal considering TCP specs.

Changes in v3:
- Remove the new socket option and change the TCP_LINGER2 behavior so
  that the timeout can be set to value between sysctl_tcp_fin_timeout
  and 2 minutes.

Changes in v2:
- Add int overflow check for the new socket option.

Changes in v1:
- Add a new socket option to set timeout greater than
  sysctl_tcp_fin_timeout.

Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 9df5335ca974 ("tcp: annotate data-races around tp->linger2")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:45:27 +02:00
Greg Kroah-Hartman
2d7501f8af Revert "tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT"
This reverts commit 0d70e638ab.  It is
part of a series that breaks the kernel abi for Android.  For this
kernel tree, we do not care so much about KASAN, so it's not a big issue
to revert.  If it is needed in the future, it can be brought back in an
abi-safe way.

Bug: 161946584
Change-Id: Ie4c15028abda17a6ea6e023fb046c7b2e3ca6827
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-06-13 07:30:40 +00:00
Greg Kroah-Hartman
9b73585f8e Revert "tcp: factor out __tcp_close() helper"
This reverts commit e62806034c.  It is
part of a series that breaks the kernel abi for Android.  For this
kernel tree, we do not care so much about KASAN, so it's not a big issue
to revert.  If it is needed in the future, it can be brought back in an
abi-safe way.

Bug: 161946584
Change-Id: If4e4f69a51b196c752fcaa21bf7d27c971248e08
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-06-13 07:30:40 +00:00
Greg Kroah-Hartman
4e2cad2c2a Merge 4.19.284 into android-4.19-stable
Changes in 4.19.284
	net: Fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
	netlink: annotate accesses to nlk->cb_running
	net: annotate sk->sk_err write from do_recvmmsg()
	tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT
	tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit
	tcp: factor out __tcp_close() helper
	tcp: add annotations around sk->sk_shutdown accesses
	ipvlan:Fix out-of-bounds caused by unclear skb->cb
	net: datagram: fix data-races in datagram_poll()
	af_unix: Fix a data race of sk->sk_receive_queue->qlen.
	af_unix: Fix data races around sk->sk_shutdown.
	fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode()
	drm/amd/display: Use DC_LOG_DC in the trasform pixel function
	regmap: cache: Return error in cache sync operations for REGCACHE_NONE
	memstick: r592: Fix UAF bug in r592_remove due to race condition
	firmware: arm_sdei: Fix sleep from invalid context BUG
	ACPI: EC: Fix oops when removing custom query handlers
	drm/tegra: Avoid potential 32-bit integer overflow
	ACPICA: Avoid undefined behavior: applying zero offset to null pointer
	ACPICA: ACPICA: check null return of ACPI_ALLOCATE_ZEROED in acpi_db_display_objects
	wifi: brcmfmac: cfg80211: Pass the PMK in binary instead of hex
	ext2: Check block size validity during mount
	net: pasemi: Fix return type of pasemi_mac_start_tx()
	net: Catch invalid index in XPS mapping
	lib: cpu_rmap: Avoid use after free on rmap->obj array entries
	scsi: message: mptlan: Fix use after free bug in mptlan_remove() due to race condition
	gfs2: Fix inode height consistency check
	ext4: set goal start correctly in ext4_mb_normalize_request
	ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()
	f2fs: fix to drop all dirty pages during umount() if cp_error is set
	wifi: iwlwifi: dvm: Fix memcpy: detected field-spanning write backtrace
	Bluetooth: L2CAP: fix "bad unlock balance" in l2cap_disconnect_rsp
	staging: rtl8192e: Replace macro RTL_PCI_DEVICE with PCI_DEVICE
	HID: logitech-hidpp: Don't use the USB serial for USB devices
	HID: logitech-hidpp: Reconcile USB and Unifying serials
	spi: spi-imx: fix MX51_ECSPI_* macros when cs > 3
	HID: wacom: generic: Set battery quirk only when we see battery data
	usb: typec: tcpm: fix multiple times discover svids error
	serial: 8250: Reinit port->pm on port specific driver unbind
	mcb-pci: Reallocate memory region to avoid memory overlapping
	sched: Fix KCSAN noinstr violation
	recordmcount: Fix memory leaks in the uwrite function
	clk: tegra20: fix gcc-7 constant overflow warning
	Input: xpad - add constants for GIP interface numbers
	phy: st: miphy28lp: use _poll_timeout functions for waits
	mfd: dln2: Fix memory leak in dln2_probe()
	btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid
	btrfs: fix space cache inconsistency after error loading it from disk
	cpupower: Make TSC read per CPU for Mperf monitor
	af_key: Reject optional tunnel/BEET mode templates in outbound policies
	net: fec: Better handle pm_runtime_get() failing in .remove()
	vsock: avoid to close connected socket after the timeout
	drivers: provide devm_platform_ioremap_resource()
	serial: arc_uart: fix of_iomap leak in `arc_serial_probe`
	ip6_gre: Fix skb_under_panic in __gre6_xmit()
	ip6_gre: Make o_seqno start from 0 in native mode
	ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode
	erspan: get the proto with the md version for collect_md
	media: netup_unidvb: fix use-after-free at del_timer()
	drm/exynos: fix g2d_open/close helper function definitions
	net: nsh: Use correct mac_offset to unwind gso skb in nsh_gso_segment()
	net: bcmgenet: Remove phy_stop() from bcmgenet_netif_stop()
	net: bcmgenet: Restore phy_stop() depending upon suspend/close
	cassini: Fix a memory leak in the error handling path of cas_init_one()
	igb: fix bit_shift to be in [1..8] range
	vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit()
	usb-storage: fix deadlock when a scsi command timeouts more than once
	usb: typec: altmodes/displayport: fix pin_assignment_show
	ALSA: hda: Fix Oops by 9.1 surround channel names
	ALSA: hda: Add NVIDIA codec IDs a3 through a7 to patch table
	statfs: enforce statfs[64] structure initialization
	serial: Add support for Advantech PCI-1611U card
	ceph: force updating the msg pointer in non-split case
	tpm/tpm_tis: Disable interrupts for more Lenovo devices
	nilfs2: fix use-after-free bug of nilfs_root in nilfs_evict_inode()
	netfilter: nftables: add nft_parse_register_load() and use it
	netfilter: nftables: add nft_parse_register_store() and use it
	netfilter: nftables: statify nft_parse_register()
	netfilter: nf_tables: validate registers coming from userspace.
	netfilter: nf_tables: add nft_setelem_parse_key()
	netfilter: nf_tables: allow up to 64 bytes in the set element data area
	netfilter: nf_tables: stricter validation of element data
	netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag
	netfilter: nf_tables: do not allow RULE_ID to refer to another chain
	HID: wacom: Force pen out of prox if no events have been received in a while
	Add Acer Aspire Ethos 8951G model quirk
	ALSA: hda/realtek - More constifications
	ALSA: hda/realtek - Add Headset Mic supported for HP cPC
	ALSA: hda/realtek - Enable headset mic of Acer X2660G with ALC662
	ALSA: hda/realtek - Enable the headset of Acer N50-600 with ALC662
	ALSA: hda/realtek - The front Mic on a HP machine doesn't work
	ALSA: hda/realtek: Fix the mic type detection issue for ASUS G551JW
	ALSA: hda/realtek - Add headset Mic support for Lenovo ALC897 platform
	ALSA: hda/realtek - ALC897 headset MIC no sound
	ALSA: hda/realtek: Add a quirk for HP EliteDesk 805
	lib/string_helpers: Introduce string_upper() and string_lower() helpers
	usb: gadget: u_ether: Convert prints to device prints
	usb: gadget: u_ether: Fix host MAC address case
	vc_screen: rewrite vcs_size to accept vc, not inode
	vc_screen: reload load of struct vc_data pointer in vcs_write() to avoid UAF
	s390/qdio: get rid of register asm
	s390/qdio: fix do_sqbs() inline assembly constraint
	spi: spi-fsl-spi: automatically adapt bits-per-word in cpu mode
	spi: fsl-spi: Re-organise transfer bits_per_word adaptation
	spi: fsl-cpm: Use 16 bit mode for large transfers with even size
	ALSA: hda/ca0132: add quirk for EVGA X299 DARK
	m68k: Move signal frame following exception on 68020/030
	parisc: Allow to reboot machine after system halt
	btrfs: use nofs when cleaning up aborted transactions
	x86/mm: Avoid incomplete Global INVLPG flushes
	selftests/memfd: Fix unknown type name build failure
	parisc: Fix flush_dcache_page() for usage from irq context
	ALSA: hda/realtek - Fixed one of HP ALC671 platform Headset Mic supported
	ALSA: hda/realtek - Fix inverted bass GPIO pin on Acer 8951G
	udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().
	USB: core: Add routines for endpoint checks in old drivers
	USB: sisusbvga: Add endpoint checks
	media: radio-shark: Add endpoint checks
	net: fix skb leak in __skb_tstamp_tx()
	bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
	ipv6: Fix out-of-bounds access in ipv6_find_tlv()
	power: supply: leds: Fix blink to LED on transition
	power: supply: bq27xxx: Fix bq27xxx_battery_update() race condition
	power: supply: bq27xxx: Fix I2C IRQ race on remove
	power: supply: bq27xxx: Fix poll_interval handling and races on remove
	power: supply: sbs-charger: Fix INHIBITED bit for Status reg
	coresight: Fix signedness bug in tmc_etr_buf_insert_barrier_packet()
	xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
	x86/show_trace_log_lvl: Ensure stack pointer is aligned, again
	ASoC: Intel: Skylake: Fix declaration of enum skl_ch_cfg
	forcedeth: Fix an error handling path in nv_probe()
	3c589_cs: Fix an error handling path in tc589_probe()
	drivers: depend on HAS_IOMEM for devm_platform_ioremap_resource()
	Linux 4.19.284

Change-Id: I88843be551e748e295ea608158a2db7ab4486a65
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-06-08 11:16:01 +00:00
Paolo Abeni
e62806034c tcp: factor out __tcp_close() helper
[ Upstream commit 77c3c95637526f1e4330cc9a4b2065f668c2c4fe ]

unlocked version of protocol level close, will be used by
MPTCP to allow decouple orphaning and subflow level close.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: e14cadfd80d7 ("tcp: add annotations around sk->sk_shutdown accesses")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30 12:42:07 +01:00
Eric Dumazet
0d70e638ab tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT
[ Upstream commit a74f0fa082b76c6a76cba5672f36218518bfdc09 ]

TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
as a step to enable bigger tcp sndbuf limits.

It works reasonably well, but the following happens :

Once the limit is reached, TCP stack generates
an [E]POLLOUT event for every incoming ACK packet.

This causes a high number of context switches.

This patch implements the strategy David Miller added
in sock_def_write_space() :

 - If TCP socket has a notsent_lowat constraint of X bytes,
   allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
   only if number of notsent bytes is below X/2

This considerably reduces TCP_NOTSENT_LOWAT overhead,
while allowing to keep the pipe full.

Tested:
 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM

A:/# cat /proc/sys/net/ipv4/tcp_wmem
4096	262144	64000000
A:/# super_netperf 100 -H B -l 1000 -- -K bbr &

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/

A:/# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
 2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
 0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
 1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
 2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0

A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow

A:/# vmstat 5 5  # check that context switches have not inflated too much.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
 0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
 1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
 1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
 1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: e14cadfd80d7 ("tcp: add annotations around sk->sk_shutdown accesses")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-30 12:42:07 +01:00
Greg Kroah-Hartman
c2ad6ebaaa Revert "tcp/udp: Make early_demux back namespacified."
This reverts commit 7162f05f1f which is
commit 11052589cf5c0bab3b4884d423d5f60c38fcf25d upstream.

It is breaks the abi and is not needed in Android systems, so revert it.

Bug: 161946584
Change-Id: I357d52cd6a635050f305127246f95cb2302633be
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2022-11-12 17:08:48 +00:00
Kuniyuki Iwashima
7162f05f1f tcp/udp: Make early_demux back namespacified.
commit 11052589cf5c0bab3b4884d423d5f60c38fcf25d upstream.

Commit e21145a987 ("ipv4: namespacify ip_early_demux sysctl knob") made
it possible to enable/disable early_demux on a per-netns basis.  Then, we
introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
TCP/UDP in commit dddb64bcb3 ("net: Add sysctl to toggle early demux for
tcp and udp").  However, the .proc_handler() was wrong and actually
disabled us from changing the behaviour in each netns.

We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
.early_demux() handler is not NULL.  When we toggle (tcp|udp)_early_demux,
the change itself is saved in each netns variable, but the .early_demux()
handler is a global variable, so the handler is switched based on the
init_net's sysctl variable.  Thus, netns (tcp|udp)_early_demux knobs have
nothing to do with the logic.  Whether we CAN execute proto .early_demux()
is always decided by init_net's sysctl knob, and whether we DO it or not is
by each netns ip_early_demux knob.

This patch namespacifies (tcp|udp)_early_demux again.  For now, the users
of the .early_demux() handler are TCP and UDP only, and they are called
directly to avoid retpoline.  So, we can remove the .early_demux() handler
from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
If another proto needs .early_demux(), we can restore it at that time.

Fixes: dddb64bcb3 ("net: Add sysctl to toggle early demux for tcp and udp")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-10 17:46:54 +01:00
Neal Cardwell
a434d10e7a tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited
[ Upstream commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14 ]

This commit fixes a bug in the tracking of max_packets_out and
is_cwnd_limited. This bug can cause the connection to fail to remember
that is_cwnd_limited is true, causing the connection to fail to grow
cwnd when it should, causing throughput to be lower than it should be.

The following event sequence is an example that triggers the bug:

 (a) The connection is cwnd_limited, but packets_out is not at its
     peak due to TSO deferral deciding not to send another skb yet.
     In such cases the connection can advance max_packets_seq and set
     tp->is_cwnd_limited to true and max_packets_out to a small
     number.

(b) Then later in the round trip the connection is pacing-limited (not
     cwnd-limited), and packets_out is larger. In such cases the
     connection would raise max_packets_out to a bigger number but
     (unexpectedly) flip tp->is_cwnd_limited from true to false.

This commit fixes that bug.

One straightforward fix would be to separately track (a) the next
window after max_packets_out reaches a maximum, and (b) the next
window after tp->is_cwnd_limited is set to true. But this would
require consuming an extra u32 sequence number.

Instead, to save space we track only the most important
information. Specifically, we track the strongest available signal of
the degree to which the cwnd is fully utilized:

(1) If the connection is cwnd-limited then we remember that fact for
the current window.

(2) If the connection not cwnd-limited then we track the maximum
number of outstanding packets in the current window.

In particular, note that the new logic cannot trigger the buggy
(a)/(b) sequence above because with the new logic a condition where
tp->packets_out > tp->max_packets_out can only trigger an update of
tp->is_cwnd_limited if tp->is_cwnd_limited is false.

This first showed up in a testing of a BBRv2 dev branch, but this
buggy behavior highlighted a general issue with the
tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
the proper rate for any TCP congestion control, including Reno or
CUBIC.

Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-10-26 13:19:26 +02:00
Kuniyuki Iwashima
3e39da4423 tcp: Fix a data-race around sysctl_tcp_adv_win_scale.
commit 36eeee75ef0157e42fb6593dcc65daab289b559e upstream.

While reading sysctl_tcp_adv_win_scale, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-11 12:48:38 +02:00
Kuniyuki Iwashima
3b26e11b07 tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
[ Upstream commit 4845b5713ab18a1bb6e31d1fbb4d600240b8b691 ]

While reading sysctl_tcp_slow_start_after_idle, it can be changed
concurrently.  Thus, we need to add READ_ONCE() to its readers.

Fixes: 35089bb203 ("[TCP]: Add tcp_slow_start_after_idle sysctl.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-29 17:10:34 +02:00
Kuniyuki Iwashima
0f75343584 tcp: Fix a data-race around sysctl_tcp_notsent_lowat.
[ Upstream commit 55be873695ed8912eb77ff46d1d1cadf028bd0f3 ]

While reading sysctl_tcp_notsent_lowat, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-29 17:10:33 +02:00
Kuniyuki Iwashima
f197442a0e tcp: Fix data-races around some timeout sysctl knobs.
[ Upstream commit 39e24435a776e9de5c6dd188836cf2523547804b ]

While reading these sysctl knobs, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.

  - tcp_retries1
  - tcp_retries2
  - tcp_orphan_retries
  - tcp_fin_timeout

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-29 17:10:33 +02:00
Eric Dumazet
6c2176f5ad tcp: make sure treq->af_specific is initialized
commit ba5a4fdd63ae0c575707030db0b634b160baddd7 upstream.

syzbot complained about a recent change in TCP stack,
hitting a NULL pointer [1]

tcp request sockets have an af_specific pointer, which
was used before the blamed change only for SYNACK generation
in non SYNCOOKIE mode.

tcp requests sockets momentarily created when third packet
coming from client in SYNCOOKIE mode were not using
treq->af_specific.

Make sure this field is populated, in the same way normal
TCP requests sockets do in tcp_conn_request().

[1]
TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies.  Check SNMP counters.
general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 1 PID: 3695 Comm: syz-executor864 Not tainted 5.18.0-rc3-syzkaller-00224-g5fd1fe4807f9 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:tcp_create_openreq_child+0xe16/0x16b0 net/ipv4/tcp_minisocks.c:534
Code: 48 c1 ea 03 80 3c 02 00 0f 85 e5 07 00 00 4c 8b b3 28 01 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7e 08 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 c9 07 00 00 48 8b 3c 24 48 89 de 41 ff 56 08 48
RSP: 0018:ffffc90000de0588 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff888076490330 RCX: 0000000000000100
RDX: 0000000000000001 RSI: ffffffff87d67ff0 RDI: 0000000000000008
RBP: ffff88806ee1c7f8 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff87d67f00 R11: 0000000000000000 R12: ffff88806ee1bfc0
R13: ffff88801b0e0368 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f517fe58700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffcead76960 CR3: 000000006f97b000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 tcp_v6_syn_recv_sock+0x199/0x23b0 net/ipv6/tcp_ipv6.c:1267
 tcp_get_cookie_sock+0xc9/0x850 net/ipv4/syncookies.c:207
 cookie_v6_check+0x15c3/0x2340 net/ipv6/syncookies.c:258
 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1131 [inline]
 tcp_v6_do_rcv+0x1148/0x13b0 net/ipv6/tcp_ipv6.c:1486
 tcp_v6_rcv+0x3305/0x3840 net/ipv6/tcp_ipv6.c:1725
 ip6_protocol_deliver_rcu+0x2e9/0x1900 net/ipv6/ip6_input.c:422
 ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:464
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:473
 dst_input include/net/dst.h:461 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 ipv6_rcv+0x27f/0x3b0 net/ipv6/ip6_input.c:297
 __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5405
 __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5519
 process_backlog+0x3a0/0x7c0 net/core/dev.c:5847
 __napi_poll+0xb3/0x6e0 net/core/dev.c:6413
 napi_poll net/core/dev.c:6480 [inline]
 net_rx_action+0x8ec/0xc60 net/core/dev.c:6567
 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097

Fixes: 5b0b9e4c2c89 ("tcp: md5: incorrect tcp_header_len for incoming connections")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[fruggeri: Account for backport conflicts from 35b2c3211609 and 6fc8c827dd4f]
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-12 12:20:25 +02:00
Eric Dumazet
cc639aa3c2 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
[ Upstream commit 4bfe744ff1644fbc0a991a2677dc874475dd6776 ]

I had this bug sitting for too long in my pile, it is time to fix it.

Thanks to Doug Porter for reminding me of it!

We had various attempts in the past, including commit
0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"),
but the issue is that TCP stack currently only generates
EPOLLOUT from input path, when tp->snd_una has advanced
and skb(s) cleaned from rtx queue.

If a flow has a big RTT, and/or receives SACKs, it is possible
that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
and no more data can be sent until tp->snd_una finally advances.

What is needed is to also check if POLLOUT needs to be generated
whenever tp->snd_nxt is advanced, from output path.

This bug triggers more often after an idle period, as
we do not receive ACK for at least one RTT. tcp_notsent_lowat
could be a fraction of what CWND and pacing rate would allow to
send during this RTT.

In a followup patch, I will remove the bogus call
to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
from tcp_check_space(). Fact that we have decided to generate
an EPOLLOUT does not mean the application has immediately
refilled the transmit queue. This optimistic call
might have been the reason the bug seemed not too serious.

Tested:

200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]

$ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
$ cat bench_rr.sh
SUM=0
for i in {1..10}
do
 V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
 echo $V
 SUM=$(($SUM + $V))
done
echo SUM=$SUM

Before patch:
$ bench_rr.sh
130000000
80000000
140000000
140000000
140000000
140000000
130000000
40000000
90000000
110000000
SUM=1140000000

After patch:
$ bench_rr.sh
430000000
590000000
530000000
450000000
450000000
350000000
450000000
490000000
480000000
460000000
SUM=4680000000  # This is 410 % of the value before patch.

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Doug Porter <dsp@fb.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-05-12 12:20:21 +02:00
Eric Dumazet
92ba49b27e tcp: annotate tp->write_seq lockless reads
[ Upstream commit 0f31746452e6793ad6271337438af8f4defb8940 ]

There are few places where we fetch tp->write_seq while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17 16:43:43 +01:00
Eric Dumazet
777d796966 tcp: fix SO_RCVLOWAT related hangs under mem pressure
[ Upstream commit f969dc5a885736842c3511ecdea240fbb02d25d9 ]

While commit 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs")
fixed an issue vs too small sk_rcvbuf for given sk_rcvlowat constraint,
it missed to address issue caused by memory pressure.

1) If we are under memory pressure and socket receive queue is empty.
First incoming packet is allowed to be queued, after commit
76dfa60820 ("tcp: allow one skb to be received per socket under memory pressure")

But we do not send EPOLLIN yet, in case tcp_data_ready() sees sk_rcvlowat
is bigger than skb length.

2) Then, when next packet comes, it is dropped, and we directly
call sk->sk_data_ready().

3) If application is using poll(), tcp_poll() will then use
tcp_stream_is_readable() and decide the socket receive queue is
not yet filled, so nothing will happen.

Even when sender retransmits packets, phases 2) & 3) repeat
and flow is effectively frozen, until memory pressure is off.

Fix is to consider tcp_under_memory_pressure() to take care
of global memory pressure or memcg pressure.

Fixes: 24adbc1676af ("tcp: fix SO_RCVLOWAT hangs with fat skbs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun Roy <arjunroy@google.com>
Suggested-by: Wei Wang <weiwan@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 09:39:36 +01:00
Pengcheng Yang
dc7bc439a8 tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
commit 62d9f1a6945ba69c125e548e72a36d203b30596e upstream.

Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.

The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.

This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.

Fixes: df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03 23:23:27 +01:00
Eric Dumazet
99779c24bf tcp: fix SO_RCVLOWAT hangs with fat skbs
[ Upstream commit 24adbc1676af4e134e709ddc7f34cf2adc2131e4 ]

We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
overhead in tcp_set_rcvlowat()

This works well when skb->len/skb->truesize ratio is bigger than 0.5

But if we receive packets with small MSS, we can end up in a situation
where not enough bytes are available in the receive queue to satisfy
RCVLOWAT setting.
As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
preventing remote peer from sending more data.

Even autotuning does not help, because it only triggers at the time
user process drains the queue. If no EPOLLIN is generated, this
can not happen.

Note poll() has a similar issue, after commit
c7004482e8 ("tcp: Respect SO_RCVLOWAT in tcp_poll().")

Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-20 08:18:38 +02:00
Eric Dumazet
3405bf51f6 tcp: cache line align MAX_TCP_HEADER
[ Upstream commit 9bacd256f1354883d3c1402655153367982bba49 ]

TCP stack is dumb in how it cooks its output packets.

Depending on MAX_HEADER value, we might chose a bad ending point
for the headers.

If we align the end of TCP headers to cache line boundary, we
make sure to always use the smallest number of cache lines,
which always help.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-29 16:31:21 +02:00
Eric Dumazet
a92c895e22 tcp: annotate lockless access to tcp_memory_pressure
[ Upstream commit 1f142c17d19a5618d5a633195a46f2c8be9bf232 ]

tcp_memory_pressure is read without holding any lock,
and its value could be changed on other cpus.

Use READ_ONCE() to annotate these lockless reads.

The write side is already using atomic ops.

Fixes: b8da51ebb1 ("tcp: introduce tcp_under_memory_pressure()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27 14:51:18 +01:00
Guillaume Nault
fbcf85b047 tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
[ Upstream commit 721c8dafad26ccfa90ff659ee19755e3377b829d ]

Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
timestamp of the last synflood. Protect them with READ_ONCE() and
WRITE_ONCE() since reads and writes aren't serialised.

Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
introduced by a0f82f64e2 ("syncookies: remove last_synq_overflow from
struct tcp_sock"). But unprotected accesses were already there when
timestamp was stored in .last_synq_overflow.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:57:19 +01:00
Guillaume Nault
4b8a98697a tcp: tighten acceptance of ACKs not matching a child socket
[ Upstream commit cb44a08f8647fd2e8db5cc9ac27cd8355fa392d8 ]

When no synflood occurs, the synflood timestamp isn't updated.
Therefore it can be so old that time_after32() can consider it to be
in the future.

That's a problem for tcp_synq_no_recent_overflow() as it may report
that a recent overflow occurred while, in fact, it's just that jiffies
has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.

Spurious detection of recent overflows lead to extra syncookie
verification in cookie_v[46]_check(). At that point, the verification
should fail and the packet dropped. But we should have dropped the
packet earlier as we didn't even send a syncookie.

Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
only if jiffies is within the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
way, no spurious recent overflow is reported when jiffies wraps and
'last_overflow' becomes in the future from the point of view of
time_after32().

However, if jiffies wraps and enters the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
'last_overflow' being a stale synflood timestamp), then
tcp_synq_no_recent_overflow() still erroneously reports an
overflow. In such cases, we have to rely on syncookie verification
to drop the packet. We unfortunately have no way to differentiate
between a fresh and a stale syncookie timestamp.

In practice, using last_overflow as lower bound is problematic.
If the synflood timestamp is concurrently updated between the time
we read jiffies and the moment we store the timestamp in
'last_overflow', then 'now' becomes smaller than 'last_overflow' and
tcp_synq_no_recent_overflow() returns true, potentially dropping a
valid syncookie.

Reading jiffies after loading the timestamp could fix the problem,
but that'd require a memory barrier. Let's just accommodate for
potential timestamp growth instead and extend the interval using
'last_overflow - HZ' as lower bound.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:57:18 +01:00
Guillaume Nault
bac9e8f345 tcp: fix rejected syncookies due to stale timestamps
[ Upstream commit 04d26e7b159a396372646a480f4caa166d1b6720 ]

If no synflood happens for a long enough period of time, then the
synflood timestamp isn't refreshed and jiffies can advance so much
that time_after32() can't accurately compare them any more.

Therefore, we can end up in a situation where time_after32(now,
last_overflow + HZ) returns false, just because these two values are
too far apart. In that case, the synflood timestamp isn't updated as
it should be, which can trick tcp_synq_no_recent_overflow() into
rejecting valid syncookies.

For example, let's consider the following scenario on a system
with HZ=1000:

  * The synflood timestamp is 0, either because that's the timestamp
    of the last synflood or, more commonly, because we're working with
    a freshly created socket.

  * We receive a new SYN, which triggers synflood protection. Let's say
    that this happens when jiffies == 2147484649 (that is,
    'synflood timestamp' + HZ + 2^31 + 1).

  * Then tcp_synq_overflow() doesn't update the synflood timestamp,
    because time_after32(2147484649, 1000) returns false.
    With:
      - 2147484649: the value of jiffies, aka. 'now'.
      - 1000: the value of 'last_overflow' + HZ.

  * A bit later, we receive the ACK completing the 3WHS. But
    cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
    says that we're not under synflood. That's because
    time_after32(2147484649, 120000) returns false.
    With:
      - 2147484649: the value of jiffies, aka. 'now'.
      - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.

    Of course, in reality jiffies would have increased a bit, but this
    condition will last for the next 119 seconds, which is far enough
    to accommodate for jiffie's growth.

Fix this by updating the overflow timestamp whenever jiffies isn't
within the [last_overflow, last_overflow + HZ] range. That shouldn't
have any performance impact since the update still happens at most once
per second.

Now we're guaranteed to have fresh timestamps while under synflood, so
tcp_synq_no_recent_overflow() can safely use it with time_after32() in
such situations.

Stale timestamps can still make tcp_synq_no_recent_overflow() return
the wrong verdict when not under synflood. This will be handled in the
next patch.

For 64 bits architectures, the problem was introduced with the
conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
cca9bab1b7 ("tcp: use monotonic timestamps for PAWS").
The problem has always been there on 32 bits architectures.

Fixes: cca9bab1b7 ("tcp: use monotonic timestamps for PAWS")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:57:17 +01:00
Eric Dumazet
3ad3a9a242 tcp: make tcp_space() aware of socket backlog
[ Upstream commit 85bdf7db5b53cdcc7a901db12bcb3d0063e3866d ]

Jean-Louis Dupond reported poor iscsi TCP receive performance
that we tracked to backlog drops.

Apparently we fail to send window updates reflecting the
fact that we are under stress.

Note that we might lack a proper window increase when
backlog is fully processed, since __release_sock() clears
sk->sk_backlog.len _after_ all skbs have been processed.

This should not matter in practice. If we had a significant
load through socket backlog, we are in a dangerous
situation.

Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Jean-Louis Dupond<jean-louis@dupond.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-13 08:52:18 +01:00
Eric Dumazet
c60f57dfe9 tcp: fix tcp_set_congestion_control() use from bpf hook
[ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ]

Neal reported incorrect use of ns_capable() from bpf hook.

bpf_setsockopt(...TCP_CONGESTION...)
  -> tcp_set_congestion_control()
   -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
    -> ns_capable_common()
     -> current_cred()
      -> rcu_dereference_protected(current->cred, 1)

Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.

As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.

The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:26 +02:00
Eric Dumazet
6323c238bb tcp: be more careful in tcp_fragment()
[ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]

Some applications set tiny SO_SNDBUF values and expect
TCP to just work. Recent patches to address CVE-2019-11478
broke them in case of losses, since retransmits might
be prevented.

We should allow these flows to make progress.

This patch allows the first and last skb in retransmit queue
to be split even if memory limits are hit.

It also adds the some room due to the fact that tcp_sendmsg()
and tcp_sendpage() might overshoot sk_wmem_queued by about one full
TSO skb (64KB size). Note this allowance was already present
in stable backports for kernels < 4.15

Note for < 4.15 backports :
 tcp_rtx_queue_tail() will probably look like :

static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
{
	struct sk_buff *skb = tcp_send_head(sk);

	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
}

Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Cc: Jonathan Looney <jtl@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28 08:29:25 +02:00
Eric Dumazet
c09be31461 tcp: limit payload size of sacked skbs
commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream.

Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17 19:51:56 +02:00
Daniel Borkmann
037b0b86ec tcp, ulp: add alias for all ulp modules
Lets not turn the TCP ULP lookup into an arbitrary module loader as
we only intend to load ULP modules through this mechanism, not other
unrelated kernel modules:

  [root@bar]# cat foo.c
  #include <sys/types.h>
  #include <sys/socket.h>
  #include <linux/tcp.h>
  #include <linux/in.h>

  int main(void)
  {
      int sock = socket(PF_INET, SOCK_STREAM, 0);
      setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
      return 0;
  }

  [root@bar]# gcc foo.c -O2 -Wall
  [root@bar]# lsmod | grep sctp
  [root@bar]# ./a.out
  [root@bar]# lsmod | grep sctp
  sctp                 1077248  4
  libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
  [root@bar]#

Fix it by adding module alias to TCP ULP modules, so probing module
via request_module() will be limited to tcp-ulp-[name]. The existing
modules like kTLS will load fine given tcp-ulp-tls alias, but others
will fail to load:

  [root@bar]# lsmod | grep sctp
  [root@bar]# ./a.out
  [root@bar]# lsmod | grep sctp
  [root@bar]#

Sockmap is not affected from this since it's either built-in or not.

Fixes: 734942cc4e ("tcp: ULP infrastructure")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16 14:58:07 -07:00
Martin KaFai Lau
40a1227ea8 tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()".  The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").

The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case.  However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.

When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie.  For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
   was updated.
2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
   and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
   There are other ordering that will trigger the similar situation
   below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
   "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
   all syncookies sent by sk1 will be handled (and rejected)
   by sk2 while sk1 is still alive.

The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).

The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map.  Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.

A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[]  OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group.  However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.

This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.

"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group.  Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.

A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed.  Before and
after the change is ~9Mpps in IPv4.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:45 +02:00
David S. Miller
19725496da Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-24 19:21:58 -07:00
David Ahern
24b711edfc net/ipv6: Fix linklocal to global address with VRF
Example setup:
    host: ip -6 addr add dev eth1 2001:db8:104::4
           where eth1 is enslaved to a VRF

    switch: ip -6 ro add 2001:db8:104::4/128 dev br1
            where br1 only has an LLA

           ping6 2001:db8:104::4
           ssh   2001:db8:104::4

(NOTE: UDP works fine if the PKTINFO has the address set to the global
address and ifindex is set to the index of eth1 with a destination an
LLA).

For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
L3 master. If it is then return the ifindex from rt6i_idev similar
to what is done for loopback.

For TCP, restore the original tcp_v6_iif definition which is needed in
most places and add a new tcp_v6_iif_l3_slave that considers the
l3_slave variability. This latter check is only needed for socket
lookups.

Fixes: 9ff7438460 ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-21 19:31:46 -07:00
David S. Miller
c4c5551df1 Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux
All conflicts were trivial overlapping changes, so reasonably
easy to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 21:17:12 -07:00
Yuchung Cheng
a0496ef2c2 tcp: do not delay ACK in DCTCP upon CE status change
Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:

""" When receiving packets, the CE codepoint MUST be processed as follows:

   1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
       true and send an immediate ACK.

   2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
       to false and send an immediate ACK.
"""

Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.

Tested with this packetdrill script provided by Larry Brakmo

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
   +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257

+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001

+0.500 < F. 9501:9501(0) ack 4 win 257

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 14:32:23 -07:00
Yuchung Cheng
27cde44a25 tcp: do not cancel delay-AcK on DCTCP special ACK
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).

Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.

The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.

Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 14:32:23 -07:00
Yuchung Cheng
a69258f7aa tcp: remove DELAYED ACK events in DCTCP
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-13 18:30:19 -07:00
Arnd Bergmann
cca9bab1b7 tcp: use monotonic timestamps for PAWS
Using get_seconds() for timestamps is deprecated since it can lead
to overflows on 32-bit systems. While the interface generally doesn't
overflow until year 2106, the specific implementation of the TCP PAWS
algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
overflow.

A related problem is that the local timestamps in CLOCK_REALTIME form
lead to unexpected behavior when settimeofday is called to set the system
clock backwards or forwards by more than 24 days.

While the first problem could be solved by using an overflow-safe method
of comparing the timestamps, a nicer solution is to use a monotonic
clocksource with ktime_get_seconds() that simply doesn't overflow (at
least not until 136 years after boot) and that doesn't change during
settimeofday().

To make 32-bit and 64-bit architectures behave the same way here, and
also save a few bytes in the tcp_options_received structure, I'm changing
the type to a 32-bit integer, which is now safe on all architectures.

Finally, the ts_recent_stamp field also (confusingly) gets used to store
a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
This is currently safe, but changing the type to 32-bit requires
some small changes there to keep it working.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-12 14:50:40 -07:00
Deepti Raghavan
4929c9428a tcp: expose both send and receive intervals for rate sample
Congestion control algorithms, which access the rate sample
through the tcp_cong_control function, only have access to the maximum
of the send and receive interval, for cases where the acknowledgment
rate may be inaccurate due to ACK compression or decimation. Algorithms
may want to use send rates and receive rates as separate signals.

Signed-off-by: Deepti Raghavan <deeptir@mit.edu>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-11 23:01:56 -07:00
John Fastabend
0ea488ff8d bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb
In commit

  'bpf: bpf_compute_data uses incorrect cb structure' (8108a77515)

we added the routine bpf_compute_data_end_sk_skb() to compute the
correct data_end values, but this has since been lost. In kernel
v4.14 this was correct and the above patch was applied in it
entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
we lost the piece that renamed bpf_compute_data_pointers to the
new function bpf_compute_data_end_sk_skb. This was done here,

e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")

When it conflicted with the following rename patch,

6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")

Finally, after a refactor I thought even the function
bpf_compute_data_end_sk_skb() was no longer needed and it was
erroneously removed.

However, we never reverted the sk_skb_convert_ctx_access() usage of
tcp_skb_cb which had been committed and survived the merge conflict.
Here we fix this by adding back the helper and *_data_end_sk_skb()
usage. Using the bpf_skc_data_end mapping is not correct because it
expects a qdisc_skb_cb object but at the sock layer this is not the
case. Even though it happens to work here because we don't overwrite
any data in-use at the socket layer and the cb structure is cleared
later this has potential to create some subtle issues. But, even
more concretely the filter.c access check uses tcp_skb_cb.

And by some act of chance though,

struct bpf_skb_data_end {
        struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */

        /* XXX 4 bytes hole, try to pack */

        void *                     data_meta;            /*    32     8 */
        void *                     data_end;             /*    40     8 */

        /* size: 48, cachelines: 1, members: 3 */
        /* sum members: 44, holes: 1, sum holes: 4 */
        /* last cacheline: 48 bytes */
};

and then tcp_skb_cb,

struct tcp_skb_cb {
	[...]
                struct {
                        __u32      flags;                /*    24     4 */
                        struct sock * sk_redir;          /*    32     8 */
                        void *     data_end;             /*    40     8 */
                } bpf;                                   /*          24 */
        };

So when we use offset_of() to track down the byte offset we get 40 in
either case and everything continues to work. Fix this mess and use
correct structures its unclear how long this might actually work for
until someone moves the structs around.

Reported-by: Martin KaFai Lau <kafai@fb.com>
Fixes: e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Fixes: 6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-07-07 15:19:30 -07:00
David S. Miller
5cd3da4ba2 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Simple overlapping changes in stmmac driver.

Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-03 10:29:26 +09:00
Linus Torvalds
a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
David Miller
d4546c2509 net: Convert GRO SKB handling to list_head.
Manage pending per-NAPI GRO packets via list_head.

Return an SKB pointer from the GRO receive handlers.  When GRO receive
handlers return non-NULL, it means that this SKB needs to be completed
at this time and removed from the NAPI queue.

Several operations are greatly simplified by this transformation,
especially timing out the oldest SKB in the list when gro_count
exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
in reverse order.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-26 11:33:04 +09:00