Changes in 4.19.285
cdc_ncm: Implement the 32-bit version of NCM Transfer Block
net: cdc_ncm: Deal with too low values of dwNtbOutMaxSize
power: supply: bq27xxx: After charger plug in/out wait 0.5s for things to stabilize
power: supply: core: Refactor power_supply_set_input_current_limit_from_supplier()
power: supply: bq24190: Call power_supply_changed() after updating input current
cdc_ncm: Fix the build warning
bluetooth: Add cmd validity checks at the start of hci_sock_ioctl()
ipv{4,6}/raw: fix output xfrm lookup wrt protocol
netfilter: ctnetlink: Support offloaded conntrack entry deletion
dmaengine: pl330: rename _start to prevent build error
net/mlx5: fw_tracer, Fix event handling
netrom: fix info-leak in nr_write_internal()
af_packet: Fix data-races of pkt_sk(sk)->num.
amd-xgbe: fix the false linkup in xgbe_phy_status
af_packet: do not use READ_ONCE() in packet_bind()
tcp: deny tcp_disconnect() when threads are waiting
tcp: Return user_mss for TCP_MAXSEG in CLOSE/LISTEN state if user_mss set
net/sched: sch_ingress: Only create under TC_H_INGRESS
net/sched: sch_clsact: Only create under TC_H_CLSACT
net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs
net/sched: Prohibit regrafting ingress or clsact Qdiscs
net: sched: fix NULL pointer dereference in mq_attach
ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use
net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report
udp6: Fix race condition in udp6_sendmsg & connect
net/sched: flower: fix possible OOB write in fl_set_geneve_opt()
net: dsa: mv88e6xxx: Increase wait after reset deactivation
watchdog: menz069_wdt: fix watchdog initialisation
mailbox: mailbox-test: Fix potential double-free in mbox_test_message_write()
ARM: 9295/1: unwind:fix unwind abort for uleb128 case
media: rcar-vin: Select correct interrupt mode for V4L2_FIELD_ALTERNATE
fbdev: modedb: Add 1920x1080 at 60 Hz video mode
fbdev: stifb: Fix info entry in sti_struct on error path
nbd: Fix debugfs_create_dir error checking
ASoC: dwc: limit the number of overrun messages
xfrm: Check if_id in inbound policy/secpath match
ASoC: ssm2602: Add workaround for playback distortions
media: dvb_demux: fix a bug for the continuity counter
media: dvb-usb: az6027: fix three null-ptr-deref in az6027_i2c_xfer()
media: dvb-usb-v2: ec168: fix null-ptr-deref in ec168_i2c_xfer()
media: dvb-usb-v2: ce6230: fix null-ptr-deref in ce6230_i2c_master_xfer()
media: dvb-usb-v2: rtl28xxu: fix null-ptr-deref in rtl28xxu_i2c_xfer
media: dvb-usb: digitv: fix null-ptr-deref in digitv_i2c_xfer()
media: dvb-usb: dw2102: fix uninit-value in su3000_read_mac_address
media: netup_unidvb: fix irq init by register it at the end of probe
media: dvb_ca_en50221: fix a size write bug
media: ttusb-dec: fix memory leak in ttusb_dec_exit_dvb()
media: mn88443x: fix !CONFIG_OF error by drop of_match_ptr from ID table
media: dvb-core: Fix use-after-free due on race condition at dvb_net
media: dvb-core: Fix kernel WARNING for blocking operation in wait_event*()
media: dvb-core: Fix use-after-free due to race condition at dvb_ca_en50221
wifi: rtl8xxxu: fix authentication timeout due to incorrect RCR value
ARM: dts: stm32: add pin map for CAN controller on stm32f7
arm64/mm: mark private VM_FAULT_X defines as vm_fault_t
scsi: core: Decrease scsi_device's iorequest_cnt if dispatch failed
wifi: b43: fix incorrect __packed annotation
netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT
ALSA: oss: avoid missing-prototype warnings
atm: hide unused procfs functions
mailbox: mailbox-test: fix a locking issue in mbox_test_message_write()
iio: adc: mxs-lradc: fix the order of two cleanup operations
HID: google: add jewel USB id
HID: wacom: avoid integer overflow in wacom_intuos_inout()
iio: dac: mcp4725: Fix i2c_master_send() return value handling
iio: dac: build ad5758 driver when AD5758 is selected
net: usb: qmi_wwan: Set DTR quirk for BroadMobi BM818
usb: gadget: f_fs: Add unbind event before functionfs_unbind
scsi: stex: Fix gcc 13 warnings
ata: libata-scsi: Use correct device no in ata_find_dev()
x86/boot: Wrap literal addresses in absolute_pointer()
ACPI: thermal: drop an always true check
gcc-12: disable '-Wdangling-pointer' warning for now
eth: sun: cassini: remove dead code
kernel/extable.c: use address-of operator on section symbols
lib/dynamic_debug.c: use address-of operator on section symbols
wifi: rtlwifi: remove always-true condition pointed out by GCC 12
hwmon: (scmi) Remove redundant pointer check
regulator: da905{2,5}: Remove unnecessary array check
rsi: Remove unnecessary boolean condition
mmc: vub300: fix invalid response handling
tty: serial: fsl_lpuart: use UARTCTRL_TXINV to send break instead of UARTCTRL_SBK
selinux: don't use make's grouped targets feature yet
ext4: add lockdep annotations for i_data_sem for ea_inode's
fbcon: Fix null-ptr-deref in soft_cursor
regmap: Account for register length when chunking
scsi: dpt_i2o: Remove broken pass-through ioctl (I2OUSERCMD)
scsi: dpt_i2o: Do not process completions with invalid addresses
wifi: rtlwifi: 8192de: correct checking of IQK reload
Linux 4.19.285
Change-Id: Iaf7feb2883577ce4296e9b14d3e6d5f88edf4005
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 3632679d9e4f879f49949bb5b050e0de553e4739 upstream.
With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
protocol field of the flow structure, build by raw_sendmsg() /
rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy
lookup when some policies are defined with a protocol in the selector.
For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
specify the protocol. Just accept all values for IPPROTO_RAW socket.
For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
without breaking backward compatibility (the value of this field was never
checked). Let's add a new kind of control message, so that the userland
could specify which protocol is used.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
CC: stable@vger.kernel.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.19.254
riscv: add as-options for modules with assembly compontents
xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE
xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in xfrm_bundle_lookup()
power/reset: arm-versatile: Fix refcount leak in versatile_reboot_probe
pinctrl: ralink: Check for null return of devm_kcalloc
perf/core: Fix data race between perf_event_set_output() and perf_mmap_close()
ip: Fix data-races around sysctl_ip_fwd_use_pmtu.
ip: Fix data-races around sysctl_ip_nonlocal_bind.
ip: Fix a data-race around sysctl_fwmark_reflect.
tcp/dccp: Fix a data-race around sysctl_tcp_fwmark_accept.
tcp: Fix data-races around sysctl_tcp_mtu_probing.
tcp: Fix a data-race around sysctl_tcp_probe_threshold.
tcp: Fix a data-race around sysctl_tcp_probe_interval.
i2c: cadence: Change large transfer count reset logic to be unconditional
net: stmmac: fix dma queue left shift overflow issue
net/tls: Fix race in TLS device down flow
igmp: Fix data-races around sysctl_igmp_llm_reports.
igmp: Fix a data-race around sysctl_igmp_max_memberships.
tcp: Fix data-races around sysctl_tcp_reordering.
tcp: Fix data-races around some timeout sysctl knobs.
tcp: Fix a data-race around sysctl_tcp_notsent_lowat.
tcp: Fix a data-race around sysctl_tcp_tw_reuse.
tcp: Fix data-races around sysctl_tcp_fastopen.
be2net: Fix buffer overflow in be_get_module_eeprom
tcp: Fix a data-race around sysctl_tcp_early_retrans.
tcp: Fix data-races around sysctl_tcp_recovery.
tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts.
tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
tcp: Fix a data-race around sysctl_tcp_retrans_collapse.
tcp: Fix a data-race around sysctl_tcp_stdurg.
tcp: Fix a data-race around sysctl_tcp_rfc1337.
tcp: Fix data-races around sysctl_tcp_max_reordering.
Revert "Revert "char/random: silence a lockdep splat with printk()""
mm/mempolicy: fix uninit-value in mpol_rebind_policy()
bpf: Make sure mac_header was set before using it
drm/tilcdc: Remove obsolete crtc_mode_valid() hack
tilcdc: tilcdc_external: fix an incorrect NULL check on list iterator
HID: multitouch: simplify the application retrieval
HID: multitouch: Lenovo X1 Tablet Gen3 trackpoint and buttons
HID: multitouch: add support for the Smart Tech panel
HID: add ALWAYS_POLL quirk to lenovo pixart mouse
dlm: fix pending remove if msg allocation fails
ima: remove the IMA_TEMPLATE Kconfig option
ALSA: memalloc: Align buffer allocations in page size
Bluetooth: Add bt_skb_sendmsg helper
Bluetooth: Add bt_skb_sendmmsg helper
Bluetooth: SCO: Replace use of memcpy_from_msg with bt_skb_sendmsg
Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg
Bluetooth: Fix passing NULL to PTR_ERR
Bluetooth: SCO: Fix sco_send_frame returning skb->len
Bluetooth: Fix bt_skb_sendmmsg not allocating partial chunks
serial: mvebu-uart: correctly report configured baudrate value
tty: drivers/tty/, stop using tty_schedule_flip()
tty: the rest, stop using tty_schedule_flip()
tty: drop tty_schedule_flip()
tty: extract tty_flip_buffer_commit() from tty_flip_buffer_push()
tty: use new tty_insert_flip_string_and_push_buffer() in pty_write()
net: usb: ax88179_178a needs FLAG_SEND_ZLP
PCI: hv: Fix multi-MSI to allow more than one MSI vector
PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
PCI: hv: Fix interrupt mapping for multi-MSI
Linux 4.19.254
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8164bc3c6ca4775c4fffb7983f7d5b3b11f5bb09
[ Upstream commit 85d0b4dbd74b95cc492b1f4e34497d3f894f5d9a ]
While reading sysctl_fwmark_reflect, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: e110861f86 ("net: add a sysctl to reflect the fwmark on replies")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60c158dc7b1f0558f6cadd5b50d0386da0000d50 ]
While reading sysctl_ip_fwd_use_pmtu, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: f87c10a8aa ("ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 4.19.228
Bluetooth: refactor malicious adv data check
s390/hypfs: include z/VM guests with access control group set
scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices
udf: Restore i_lenAlloc when inode expansion fails
udf: Fix NULL ptr deref when converting from inline format
PM: wakeup: simplify the output logic of pm_show_wakelocks()
drm/etnaviv: relax submit size limits
netfilter: nft_payload: do not update layer 4 checksum when mangling fragments
serial: 8250: of: Fix mapped region size when using reg-offset property
serial: stm32: fix software flow control transfer
tty: n_gsm: fix SW flow control encoding/handling
tty: Add support for Brainboxes UC cards.
usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
usb: common: ulpi: Fix crash in ulpi_match()
usb: gadget: f_sourcesink: Fix isoc transfer for USB_SPEED_SUPER_PLUS
USB: core: Fix hang in usb_kill_urb by adding memory barriers
usb: typec: tcpm: Do not disconnect while receiving VBUS off
net: sfp: ignore disabled SFP node
powerpc/32: Fix boot failure with GCC latent entropy plugin
i40e: Increase delay to 1 s after global EMP reset
i40e: Fix issue when maximum queues is exceeded
i40e: Fix queues reservation for XDP
i40e: fix unsigned stat widths
rpmsg: char: Fix race between the release of rpmsg_ctrldev and cdev
rpmsg: char: Fix race between the release of rpmsg_eptdev and cdev
scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put()
ipv6_tunnel: Rate limit warning messages
net: fix information leakage in /proc/net/ptype
ping: fix the sk_bound_dev_if match in ping_lookup
ipv4: avoid using shared IP generator for connected sockets
hwmon: (lm90) Reduce maximum conversion rate for G781
NFSv4: Handle case where the lookup of a directory fails
NFSv4: nfs_atomic_open() can race when looking up a non-regular file
net-procfs: show net devices bound packet types
drm/msm: Fix wrong size calculation
drm/msm/dsi: invalid parameter check in msm_dsi_phy_enable
ipv6: annotate accesses to fn->fn_sernum
NFS: Ensure the server has an up to date ctime before hardlinking
NFS: Ensure the server has an up to date ctime before renaming
phylib: fix potential use-after-free
ibmvnic: init ->running_cap_crqs early
ibmvnic: don't spin in tasklet
yam: fix a memory leak in yam_siocdevprivate()
ipv4: raw: lock the socket in raw_bind()
ipv4: tcp: send zero IPID in SYNACK messages
netfilter: nat: remove l4 protocol port rovers
netfilter: nat: limit port clash resolution attempts
tcp: fix possible socket leaks in internal pacing mode
ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
net: amd-xgbe: ensure to reset the tx_timer_active flag
net: amd-xgbe: Fix skb data length underflow
rtnetlink: make sure to refresh master_dev/m_ops in __rtnl_newlink()
af_packet: fix data-race in packet_setsockopt / packet_setsockopt
audit: improve audit queue handling when "audit=1" on cmdline
ASoC: ops: Reject out of bounds values in snd_soc_put_volsw()
ASoC: ops: Reject out of bounds values in snd_soc_put_volsw_sx()
ASoC: ops: Reject out of bounds values in snd_soc_put_xr_sx()
ALSA: hda/realtek: Add missing fixup-model entry for Gigabyte X570 ALC1220 quirks
ALSA: hda/realtek: Fix silent output on Gigabyte X570S Aorus Master (newer chipset)
ALSA: hda/realtek: Fix silent output on Gigabyte X570 Aorus Xtreme after reboot from Windows
drm/nouveau: fix off by one in BIOS boundary checking
block: bio-integrity: Advance seed correctly for larger interval sizes
Revert "ASoC: mediatek: Check for error clk pointer"
RDMA/mlx4: Don't continue event handler after memory allocation failure
iommu/vt-d: Fix potential memory leak in intel_setup_irq_remapping()
iommu/amd: Fix loop timeout issue in iommu_ga_log_enable()
spi: bcm-qspi: check for valid cs before applying chip select
spi: mediatek: Avoid NULL pointer crash in interrupt
spi: meson-spicc: add IRQ check in meson_spicc_probe
net: ieee802154: hwsim: Ensure proper channel selection at probe time
net: ieee802154: mcr20a: Fix lifs/sifs periods
net: ieee802154: ca8210: Stop leaking skb's
net: ieee802154: Return meaningful error codes from the netlink helpers
net: macsec: Verify that send_sci is on when setting Tx sci explicitly
net: stmmac: ensure PTP time register reads are consistent
drm/i915/overlay: Prevent divide by zero bugs in scaling
ASoC: fsl: Add missing error handling in pcm030_fabric_probe
ASoC: cpcap: Check for NULL pointer after calling of_get_child_by_name
ASoC: max9759: fix underflow in speaker_gain_control_put()
scsi: bnx2fc: Make bnx2fc_recv_frame() mp safe
nfsd: nfsd4_setclientid_confirm mistakenly expires confirmed client.
selftests: futex: Use variable MAKE instead of make
rtc: cmos: Evaluate century appropriate
EDAC/altera: Fix deferred probing
EDAC/xgene: Fix deferred probing
ext4: fix error handling in ext4_restore_inline_data()
Linux 4.19.228
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I519bfcc5c5e4bba354c76e47dad34fba237809c0
commit 23f57406b82de51809d5812afd96f210f8b627f3 upstream.
ip_select_ident_segs() has been very conservative about using
the connected socket private generator only for packets with IP_DF
set, claiming it was needed for some VJ compression implementations.
As mentioned in this referenced document, this can be abused.
(Ref: Off-Path TCP Exploits of the Mixed IPID Assignment)
Before switching to pure random IPID generation and possibly hurt
some workloads, lets use the private inet socket generator.
Not only this will remove one vulnerability, this will also
improve performance of TCP flows using pmtudisc==IP_PMTUDISC_DONT
Fixes: 73f156a6e8 ("inetpeer: get rid of ip_id_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reported-by: Ray Che <xijiache@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 92071a2b8f7f ("net: lwtunnel: handle MTU calculation in
forwading") was backported to 5.4.132 and it caused a crc change in
include/net/ip.h due to a new .h file being included.
No real abi change happened here, so just #ifdef the .h file out when
doing the crc check.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I1bad0458bc0f9130239fc94ca982e800681cd443
[ Upstream commit fade56410c22cacafb1be9f911a0afd3701d8366 ]
Commit 14972cbd34 ("net: lwtunnel: Handle fragmentation") moved
fragmentation logic away from lwtunnel by carry encap headroom and
use it in output MTU calculation. But the forwarding part was not
covered and created difference in MTU for output and forwarding and
further to silent drops on ipv4 forwarding path. Fix it by taking
into account lwtunnel encap headroom.
The same commit also introduced difference in how to treat RTAX_MTU
in IPv4 and IPv6 where latter explicitly removes lwtunnel encap
headroom from route MTU. Make IPv4 version do the same.
Fixes: 14972cbd34 ("net: lwtunnel: Handle fragmentation")
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 02a1b175b0e92d9e0fa5df3957ade8d733ceb6a0 ]
Documentation/networking/ip-sysctl.txt:46 says:
ip_forward_use_pmtu - BOOLEAN
By default we don't trust protocol path MTUs while forwarding
because they could be easily forged and can lead to unwanted
fragmentation by the router.
You only need to enable this if you have user-space software
which tries to discover path mtus by itself and depends on the
kernel honoring this information. This is normally not the case.
Default: 0 (disabled)
Possible values:
0 - disabled
1 - enabled
Which makes it pretty clear that setting it to 1 is a potential
security/safety/DoS issue, and yet it is entirely reasonable to want
forwarded traffic to honour explicitly administrator configured
route mtus (instead of defaulting to device mtu).
Indeed, I can't think of a single reason why you wouldn't want to.
Since you configured a route mtu you probably know better...
It is pretty common to have a higher device mtu to allow receiving
large (jumbo) frames, while having some routes via that interface
(potentially including the default route to the internet) specify
a lower mtu.
Note that ipv6 forwarding uses device mtu unless the route is locked
(in which case it will use the route mtu).
This approach is not usable for IPv4 where an 'mtu lock' on a route
also has the side effect of disabling TCP path mtu discovery via
disabling the IPv4 DF (don't frag) bit on all outgoing frames.
I'm not aware of a way to lock a route from an IPv6 RA, so that also
potentially seems wrong.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Sunmeet Gill (Sunny) <sgill@quicinc.com>
Cc: Vinay Paradkar <vparadka@qti.qualcomm.com>
Cc: Tyler Wear <twear@quicinc.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8c83f2df9c6578ea4c5b940d8238ad8a41b87e9e ]
Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.
v2->v3:
- Simplify by passing the original netdevice down the stack (per David
Ahern).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5e1a99eae84999a2536f50a0beaf5d5262337f40 ]
For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
But for ip -6 route, currently we only support tcp, udp and icmp.
Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.
v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: eacb9384a3 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
after modification by __sock_cmsg_send, by calling sock_tx_timestamp.
The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
individual protocols that support tx timestamps call this function
and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
called in __ip6_append_data.
There is no need to store both tx_flags and ts_flags in the cookie
as one is derived from the other. Convert when setting up the cork
and remove the redundant field. This is similar to IPv6, only have
the conversion happen only once per datagram, in ip(6)_setup_cork.
Also change __ip6_append_data to match __ip_append_data. Only update
tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
redundant: only valid protocols can have non-zero cork->tx_flags.
After this change the IPv4 and IPv6 logic is the same.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the cookie in one location to reduce code duplication and
avoid bugs from inconsistent initialization, such as that fixed in
commit 9887cba199 ("ip: limit use of gso_size to udp").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.
There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces __ip_queue_xmit(), through which the callers
can pass tos param into it without having to set inet->tos. For
ipv6, ip6_xmit() already allows passing tclass parameter.
It's needed when some transport protocol doesn't use inet->tos,
like sctp's per transport dscp, which will be added in next patch.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a followup to fib rules sport, dport and ipproto
match support. Only supports tcp, udp and icmp for ipproto.
Used by fib rule self tests.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP segmentation offload needs access to inet_cork in the udp layer.
Pass the struct to ip(6)_make_skb instead of allocating it on the
stack in that function itself.
This patch is a noop otherwise.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move logic of fib_convert_metrics into ip_metrics_convert. This allows
the code that converts netlink attributes into metrics struct to be
re-used in a later patch by IPv6.
This is mostly a code move with the following changes to variable names:
- fi->fib_net becomes net
- fc_mx and fc_mx_len are passed as inputs pulled from fib_config
- metrics array is passed as an input from fi->fib_metrics->metrics
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405 ("inet: frag: don't account number
of fragment queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to the rework of PMTU information storage in commit
2c8cec5c10 ("ipv4: Cache learned PMTU information in inetpeer."),
when a PMTU event advertising a PMTU smaller than
net.ipv4.route.min_pmtu was received, we would disable setting the DF
flag on packets by locking the MTU metric, and set the PMTU to
net.ipv4.route.min_pmtu.
Since then, we don't disable DF, and set PMTU to
net.ipv4.route.min_pmtu, so the intermediate router that has this link
with a small MTU will have to drop the packets.
This patch reestablishes pre-2.6.39 behavior by splitting
rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu.
rt_mtu_locked indicates that we shouldn't set the DF bit on that path,
and is checked in ip_dont_fragment().
One possible workaround is to set net.ipv4.route.min_pmtu to a value low
enough to accommodate the lowest MTU encountered.
Fixes: 2c8cec5c10 ("ipv4: Cache learned PMTU information in inetpeer.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ran simple script to find/remove trailing whitespace and blank lines
at EOF because that kind of stuff git whines about and editors leave
behind.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 stack reacts to changes to small MTU, by disabling itself under
RTNL.
But there is a window where threads not using RTNL can see a wrong
device mtu. This can lead to surprises, in igmp code where it is
assumed the mtu is suitable.
Fix this by reading device mtu once and checking IPv4 minimal MTU.
This patch adds missing IPV4_MIN_MTU define, to not abuse
ETH_MIN_MTU anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the udp[46]_portaddr_hash()
to net/ip[v6].h. The function name is renamed to
ipv[46]_portaddr_hash().
It will be used by a later patch which adds a second listener
hashtable hashed by the address and port.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on yet another syzkaller report, I found
that our IP_MAX_MTU enforcements were not properly done.
gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
final result can be bigger than IP_MAX_MTU :/
This is a problem because device mtu can be changed on other cpus or
threads.
While this patch does not fix the issue I am working on, it is
probably worth addressing it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a second device index, sdif, to udp socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.
Early demux lookups are handled in the next patch as part of INET_MATCH
changes.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__ip_options_echo() uses the current network namespace, and
currently retrives it via skb->dst->dev.
This commit adds an explicit 'net' argument to __ip_options_echo()
and update all the call sites to provide it, usually via a simpler
sock_net().
After this change, __ip_options_echo() no more needs to access
skb->dst and we can drop a couple of hack to preserve such
info in the rx path.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Don't get the metric RTAX_ADVMSS of dst.
There are two reasons.
1) Its caller dst_metric_advmss has already invoke dst_metric_advmss
before invoke default_advmss.
2) The ipv4_default_advmss is used to get the default mss, it should
not try to get the metric like ip6_default_advmss.
2. Use sizeof(tcphdr)+sizeof(iphdr) instead of literal 40.
3. Define one new macro IPV4_MAX_PMTU instead of 65535 according to
RFC 2675, section 5.1.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
that denotes the first unprivileged inet port in the namespace. To
disable all privileged ports set this to zero. It also checks for
overlap with the local port range. The privileged and local range may
not overlap.
The use case for this change is to allow containerized processes to bind
to priviliged ports, but prevent them from ever being allowed to modify
their container's network configuration. The latter is accomplished by
ensuring that the network namespace is not a child of the user
namespace. This modification was needed to allow the container manager
to disable a namespace's priviliged port restrictions without exposing
control of the network namespace to processes in the user namespace.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Use the UID in routing lookups made by protocol connect() and
sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
(e.g., Path MTU discovery) take the UID of the socket into
account.
- For packets not associated with a userspace socket, (e.g., ping
replies) use UID 0 inside the user namespace corresponding to
the network namespace the socket belongs to. This allows
all namespaces to apply routing and iptables rules to
kernel-originated traffic in that namespaces by matching UID 0.
This is better than using the UID of the kernel socket that is
sending the traffic, because the UID of kernel sockets created
at namespace creation time (e.g., the per-processor ICMP and
TCP sockets) is the UID of the user that created the socket,
which might not be mapped in the namespace.
Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.
Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.
Fixes: c7ba65d7b6 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
First bug was added in commit ad6f939ab1 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
Then commit e6afc8ace6 ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.
Fixes: ad6f939ab1 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.
skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10eca1, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.
The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.
Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to introduce the generic interfaces for snmp_get_cpu_field{,64}.
It exchanges the two for-loops for collecting the percpu statistics data.
This can aggregate the data by going through all the items of each cpu
sequentially.
Signed-off-by: Jia He <hejianet@gmail.com>
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This flag indicates whether fragmentation of segments is allowed.
Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
either ip_forward or ipmr_forward).
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).
However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).
OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
- In case of local sockets, sk is same as skb->sk;
- In case of a udp tunnel, sk is the tunneling socket.
Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.
The previous patch moving the skb->dev change to L3 means nothing else
is needed for IPv6; it just works.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is nothing related to BH in SNMP counters anymore,
since linux-3.0.
Rename helpers to use __ prefix instead of _BH prefix,
for contexts where preemption is disabled.
This more closely matches convention used to update
percpu variables.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>