[ Upstream commit 9bc4ffd32ef8943f5c5a42c9637cfd04771d021b ]
psci_init_system_suspend() invokes suspend_set_ops() very early during
bootup even before kernel command line for mem_sleep_default is setup.
This leads to kernel command line mem_sleep_default=s2idle not working
as mem_sleep_current gets changed to deep via suspend_set_ops() and never
changes back to s2idle.
Set mem_sleep_current along with mem_sleep_default during kernel command
line setup as default suspend mode.
Fixes: faf7ec4a92 ("drivers: firmware: psci: add system suspend support")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Maulik Shah <quic_mkshah@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9b13df3fb64ee95e2397585404e442afee2c7d4f ]
The timer related functions do not have a strict timer_ prefixed namespace
which is really annoying.
Rename del_timer_sync() to timer_delete_sync() and provide del_timer_sync()
as a wrapper. Document that del_timer_sync() is not for new code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/r/20221123201624.954785441@linutronix.de
Stable-dep-of: 0f7352557a35 ("wifi: brcmfmac: Fix use-after-free bug in brcmf_cfg80211_detach")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 168f6b6ffbeec0b9333f3582e4cf637300858db5 ]
del_timer_sync() is assumed to be pointless on uniprocessor systems and can
be mapped to del_timer() because in theory del_timer() can never be invoked
while the timer callback function is executed.
This is not entirely true because del_timer() can be invoked from interrupt
context and therefore hit in the middle of a running timer callback.
Contrary to that del_timer_sync() is not allowed to be invoked from
interrupt context unless the affected timer is marked with TIMER_IRQSAFE.
del_timer_sync() has proper checks in place to detect such a situation.
Give up on the UP optimization and make del_timer_sync() unconditionally
available.
Co-developed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org
Link: https://lore.kernel.org/r/20221123201624.888306160@linutronix.de
Stable-dep-of: 0f7352557a35 ("wifi: brcmfmac: Fix use-after-free bug in brcmf_cfg80211_detach")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 030dcdd197d77374879bb5603d091eee7d8aba80 ]
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted. If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling del_timer_sync() can lead to two issues:
- If the caller is on a remote CPU then it has to spin wait for the timer
handler to complete. This can result in unbound priority inversion.
- If the caller originates from the task which preempted the timer
handler on the same CPU, then spin waiting for the timer handler to
complete is never going to end.
To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If del_timer_sync() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.
This addresses both the priority inversion and the life lock issues.
This mechanism is not used for timers which are marked IRQSAFE as for those
preemption is disabled accross the callback and therefore this situation
cannot happen. The callbacks for such timers need to be individually
audited for RT compliance.
The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
del_timer_sync() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.
As the softirq thread can be preempted with PREEMPT_RT=y, the SMP variant
of del_timer_sync() needs to be used on UP as well.
[ tglx: Refactored it for mainline ]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.832418500@linutronix.de
Stable-dep-of: 0f7352557a35 ("wifi: brcmfmac: Fix use-after-free bug in brcmf_cfg80211_detach")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f28d3d5346e97e60c81f933ac89ccf015430e5cf ]
Timers are added to the timer wheel off by one. This is required in
case a timer is queued directly before incrementing jiffies to prevent
early timer expiry.
When reading a timer trace and relying only on the expiry time of the timer
in the timer_start trace point and on the now in the timer_expiry_entry
trace point, it seems that the timer fires late. With the current
timer_expiry_entry trace point information only now=jiffies is printed but
not the value of base->clk. This makes it impossible to draw a conclusion
to the index of base->clk and makes it impossible to examine timer problems
without additional trace points.
Therefore add the base->clk value to the timer_expire_entry trace
point, to be able to calculate the index the timer base is located at
during collecting expired timers.
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20190321120921.16463-5-anna-maria@linutronix.de
Stable-dep-of: 0f7352557a35 ("wifi: brcmfmac: Fix use-after-free bug in brcmf_cfg80211_detach")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7a4b21250bf79eef26543d35bd390448646c536b ]
The stackmap code relies on roundup_pow_of_two() to compute the number
of hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code.
The commit in the fixes tag actually attempted to fix this, but the fix
did not account for the UB, so the fix only works on CPUs where an
overflow does result in a neat truncation to zero, which is not
guaranteed. Checking the value before rounding does not have this
problem.
Fixes: 6183f4d3a0a2 ("bpf: Check for integer overflow when using roundup_pow_of_two()")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>
Message-ID: <20240307120340.99577-4-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6787d916c2cf9850c97a0a3f73e08c43e7d973b1 ]
The hashtab code relies on roundup_pow_of_two() to compute the number of
hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code. So apply the same
fix to hashtab, by moving the overflow check to before the roundup.
Fixes: daaf427c6a ("bpf: fix arraymap NULL deref and missing overflow and zero size checks")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Message-ID: <20240307120340.99577-3-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 14274d0bd31b4debf28284604589f596ad2e99f2 ]
So far, get_device_system_crosststamp() unconditionally passes
system_counterval.cycles to timekeeping_cycles_to_ns(). But when
interpolating system time (do_interp == true), system_counterval.cycles is
before tkr_mono.cycle_last, contrary to the timekeeping_cycles_to_ns()
expectations.
On x86, CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE will mitigate on
interpolating, setting delta to 0. With delta == 0, xtstamp->sys_monoraw
and xtstamp->sys_realtime are then set to the last update time, as
implicitly expected by adjust_historical_crosststamp(). On other
architectures, the resulting nonsense xtstamp->sys_monoraw and
xtstamp->sys_realtime corrupt the xtstamp (ts) adjustment in
adjust_historical_crosststamp().
Fix this by deriving xtstamp->sys_monoraw and xtstamp->sys_realtime from
the last update time when interpolating, by using the local variable
"cycles". The local variable already has the right value when
interpolating, unlike system_counterval.cycles.
Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20231218073849.35294-4-peter.hilber@opensynergy.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 87a41130881995f82f7adbafbfeddaebfb35f0ef ]
The cycle_between() helper checks if parameter test is in the open interval
(before, after). Colloquially speaking, this also applies to the counter
wrap-around special case before > after. get_device_system_crosststamp()
currently uses cycle_between() at the first call site to decide whether to
interpolate for older counter readings.
get_device_system_crosststamp() has the following problem with
cycle_between() testing against an open interval: Assume that, by chance,
cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for
brevity). Then, cycle_between() at the first call site, with effective
argument values cycle_between(cycle_last, cycles, now), returns false,
enabling interpolation. During interpolation,
get_device_system_crosststamp() will then call cycle_between() at the
second call site (if a history_begin was supplied). The effective argument
values are cycle_between(history_begin->cycles, cycles, cycles), since
system_counterval.cycles == interval_start == cycles, per the assumption.
Due to the test against the open interval, cycle_between() returns false
again. This causes get_device_system_crosststamp() to return -EINVAL.
This failure should be avoided, since get_device_system_crosststamp() works
both when cycles follows cycle_last (no interpolation), and when cycles
precedes cycle_last (interpolation). For the case cycles == cycle_last,
interpolation is actually unneeded.
Fix this by changing cycle_between() into timestamp_in_interval(), which
now checks against the closed interval, rather than the open interval.
This changes the get_device_system_crosststamp() behavior for three corner
cases:
1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last,
fixing the problem described above.
2. At the first timestamp_in_interval() call site, cycles == now no longer
causes failure.
3. At the second timestamp_in_interval() call site, history_begin->cycles
== system_counterval.cycles no longer causes failure.
adjust_historical_crosststamp() also works for this corner case,
where partial_history_cycles == total_history_cycles.
These behavioral changes should not cause any problems.
Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231218073849.35294-3-peter.hilber@opensynergy.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 84dccadd3e2a3f1a373826ad71e5ced5e76b0c00 ]
cycle_between() decides whether get_device_system_crosststamp() will
interpolate for older counter readings.
cycle_between() yields wrong results for a counter wrap-around where after
< before < test, and for the case after < test < before.
Fix the comparison logic.
Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20231218073849.35294-2-peter.hilber@opensynergy.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f7ec1cd5cc7ef3ad964b677ba82b8b77f1c93009 ]
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call
getrusage() at the same time and the process has NR_THREADS, spin_lock_irq
will spin with irqs disabled O(NR_CPUS * NR_THREADS) time.
Change getrusage() to use sig->stats_lock, it was specifically designed
for this type of use. This way it runs lockless in the likely case.
TODO:
- Change do_task_stat() to use sig->stats_lock too, then we can
remove spin_lock_irq(siglock) in wait_task_zombie().
- Turn sig->stats_lock into seqcount_rwlock_t, this way the
readers in the slow mode won't exclude each other. See
https://lore.kernel.org/all/20230913154907.GA26210@redhat.com/
- stats_lock has to disable irqs because ->siglock can be taken
in irq context, it would be very nice to change __exit_signal()
to avoid the siglock->stats_lock dependency.
Link: https://lkml.kernel.org/r/20240122155053.GA26214@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 13b7bc60b5353371460a203df6c38ccd38ad7a3a ]
do/while_each_thread should be avoided when possible.
Plus this change allows to avoid lock_task_sighand(), we can use rcu
and/or sig->stats_lock instead.
Link: https://lkml.kernel.org/r/20230909172629.GA20454@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: f7ec1cd5cc7e ("getrusage: use sig->stats_lock rather than lock_task_sighand()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit daa694e4137571b4ebec330f9a9b4d54aa8b8089 ]
Patch series "getrusage: use sig->stats_lock", v2.
This patch (of 2):
thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.
This is also preparation for the next patch which changes getrusage() to
use stats_lock instead of siglock, thread_group_cputime() takes the same
lock. With the current implementation recursive read_seqbegin_or_lock()
is fine, thread_group_cputime() can't enter the slow mode if the caller
holds stats_lock, yet this looks more safe and better performance-wise.
Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com
Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dylan Hatch <dylanbhatch@google.com>
Tested-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bdd565f817a74b9e30edec108f7cb1dbc762b8a6 ]
There are two 'struct timeval' fields in 'struct rusage'.
Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.
While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.
In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Stable-dep-of: daa694e41375 ("getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
This reverts commit b2ea47a79d as it is
needed in Android systems. It still breaks the ABI, but that is no
longer an issue on this branch.
Bug: 317347552
Change-Id: I9c7f8901c5aabf6e0378b5218f587d6c7335522f
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 079be8fc630943d9fc70a97807feb73d169ee3fc ]
The validation of the value written to sched_rt_period_us was broken
because:
- the sysclt_sched_rt_period is declared as unsigned int
- parsed by proc_do_intvec()
- the range is asserted after the value parsed by proc_do_intvec()
Because of this negative values written to the file were written into a
unsigned integer that were later on interpreted as large positive
integers which did passed the check:
if (sysclt_sched_rt_period <= 0)
return EINVAL;
This commit fixes the parsing by setting explicit range for both
perid_us and runtime_us into the sched_rt_sysctls table and processes
the values with proc_dointvec_minmax() instead.
Alternatively if we wanted to use full range of unsigned int for the
period value we would have to split the proc_handler and use
proc_douintvec() for it however even the
Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
INT_MAX.
As far as I can tell the only problem this causes is that the sysctl
file allows writing negative values which when read back may confuse
userspace.
There is also a LTP test being submitted for these sysctl files at:
http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
[ pvorel: rebased for 4.19 ]
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c1fc6484e1fb7cc2481d169bfef129a1b0676abe ]
The sched_rr_timeslice can be reset to default by writing value that is
<= 0. However after reading from this file we always got the last value
written, which is not useful at all.
$ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms
$ cat /proc/sys/kernel/sched_rr_timeslice_ms
-1
Fix this by setting the variable that holds the sysctl file value to the
jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written.
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c7fcb99877f9f542c918509b2801065adcaf46fa ]
There is a 10% rounding error in the intial value of the
sysctl_sched_rr_timeslice with CONFIG_HZ_300=y.
This was found with LTP test sched_rr_get_interval01:
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
What this test does is to compare the return value from the
sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and
fails if they do not match.
The problem it found is the intial sysctl file value which was computed as:
static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
which works fine as long as MSEC_PER_SEC is multiple of HZ, however it
introduces 10% rounding error for CONFIG_HZ_300:
(MSEC_PER_SEC / HZ) * (100 * HZ / 1000)
(1000 / 300) * (100 * 300 / 1000)
3 * 30 = 90
This can be easily fixed by reversing the order of the multiplication
and division. After this fix we get:
(MSEC_PER_SEC * (100 * HZ / 1000)) / HZ
(1000 * (100 * 300 / 1000)) / 300
(1000 * 30) / 300 = 100
Fixes: 975e155ed8 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds")
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz
[ pvorel: rebased for 4.19 ]
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 944d5fe50f3f03daacfea16300e656a1691c4a23 upstream.
On some systems, sys_membarrier can be very expensive, causing overall
slowdowns for everything. So put a lock on the path in order to
serialize the accesses to prevent the ability for this to be called at
too high of a frequency and saturate the machine.
Reviewed-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Fixes: 22e4ebb975 ("membarrier: Provide expedited private command")
Fixes: c5f58bd58f ("membarrier: Provide GLOBAL_EXPEDITED command")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ converted to explicit mutex_*() calls - cleanup.h is not in this stable
branch - gregkh ]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 66bbea9ed6446b8471d365a22734dc00556c4785 upstream.
The return type for ring_buffer_poll_wait() is __poll_t. This is behind
the scenes an unsigned where we can set event bits. In case of a
non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us,
this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so
user-space at least is aware something went wrong.
Nonetheless, this is an incorrect code. Replace that -EINVAL with a
proper EPOLLERR to clean that output. As this doesn't change the
behaviour, there's no need to treat this change as a bug fix.
Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com
Cc: stable@vger.kernel.org
Fixes: 6721cb6002 ("ring-buffer: Do not poll non allocated cpu buffers")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dad6a09f3148257ac1773cd90934d721d68ab595 upstream.
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.
For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.
Make sure such situations never go unnoticed and warn when that happens.
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 20c20bd11a0702ce4dc9300c3da58acf551d9725 ]
map is the pointer of outer map, and need_defer needs some explanation.
need_defer tells the implementation to defer the reference release of
the passed element and ensure that the element is still alive before
the bpf program, which may manipulate it, exits.
The following three cases will invoke map_fd_put_ptr() and different
need_defer values will be passed to these callers:
1) release the reference of the old element in the map during map update
or map deletion. The release must be deferred, otherwise the bpf
program may incur use-after-free problem, so need_defer needs to be
true.
2) release the reference of the to-be-added element in the error path of
map update. The to-be-added element is not visible to any bpf
program, so it is OK to pass false for need_defer parameter.
3) release the references of all elements in the map during map release.
Any bpf program which has access to the map must have been exited and
released, so need_defer=false will be OK.
These two parameters will be used by the following patches to fix the
potential use-after-free problem for map-in-map.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 022732e3d846e197539712e51ecada90ded0572a ]
When auditd_set sets the auditd_conn pointer, audit messages can
immediately be put on the socket by other kernel threads. If the backlog
is large or the rate is high, this can immediately fill the socket
buffer. If the audit daemon requested an ACK for this operation, a full
socket buffer causes the ACK to get dropped, also setting ENOBUFS on the
socket.
To avoid this race and ensure ACKs get through, fast-track the ACK in
this specific case to ensure it is sent before auditd_conn is set.
Signed-off-by: Chris Riches <chris.riches@nutanix.com>
[PM: fix some tab vs space damage]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9a574ea9069be30b835a3da772c039993c43369b upstream.
Commit 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs
CPU hotplug") preserved total idle sleep time and iowait sleeptime across
CPU hotplug events.
Similar reasoning applies to the number of idle calls and idle sleeps to
get the proper average of sleep time per idle invocation.
Preserve those fields too.
Fixes: 71fee48f ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug")
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2b44760609e9eaafc9d234a6883d042fc21132a7 ]
Running the following two commands in parallel on a multi-processor
AArch64 machine can sporadically produce an unexpected warning about
duplicate histogram entries:
$ while true; do
echo hist:key=id.syscall:val=hitcount > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
sleep 0.001
done
$ stress-ng --sysbadaddr $(nproc)
The warning looks as follows:
[ 2911.172474] ------------[ cut here ]------------
[ 2911.173111] Duplicates detected: 1
[ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
[ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
[ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
[ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G E 6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
[ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
[ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
[ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
[ 2911.185310] sp : ffff8000a1513900
[ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
[ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
[ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
[ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
[ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
[ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
[ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
[ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
[ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
[ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
[ 2911.194259] Call trace:
[ 2911.194626] tracing_map_sort_entries+0x3e0/0x408
[ 2911.195220] hist_show+0x124/0x800
[ 2911.195692] seq_read_iter+0x1d4/0x4e8
[ 2911.196193] seq_read+0xe8/0x138
[ 2911.196638] vfs_read+0xc8/0x300
[ 2911.197078] ksys_read+0x70/0x108
[ 2911.197534] __arm64_sys_read+0x24/0x38
[ 2911.198046] invoke_syscall+0x78/0x108
[ 2911.198553] el0_svc_common.constprop.0+0xd0/0xf8
[ 2911.199157] do_el0_svc+0x28/0x40
[ 2911.199613] el0_svc+0x40/0x178
[ 2911.200048] el0t_64_sync_handler+0x13c/0x158
[ 2911.200621] el0t_64_sync+0x1a8/0x1b0
[ 2911.201115] ---[ end trace 0000000000000000 ]---
The problem appears to be caused by CPU reordering of writes issued from
__tracing_map_insert().
The check for the presence of an element with a given key in this
function is:
val = READ_ONCE(entry->val);
if (val && keys_match(key, val->key, map->key_size)) ...
The write of a new entry is:
elt = get_free_elt(map);
memcpy(elt->key, key, map->key_size);
entry->val = elt;
The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
stores may become visible in the reversed order on another CPU. This
second CPU might then incorrectly determine that a new key doesn't match
an already present val->key and subsequently insert a new element,
resulting in a duplicate.
Fix the problem by adding a write barrier between
"memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
and the "val->key" check which has an address dependency.
The barrier is placed on a path executed when adding an element for
a new key. Subsequent updates targeting the same key remain unaffected.
From the user's perspective, the issue was introduced by commit
c193707dde ("tracing: Remove code which merges duplicates"), which
followed commit cbf4100efb ("tracing: Add support to detect and avoid
duplicates"). The previous code operated differently; it inherently
expected potential races which result in duplicates but merged them
later when they occurred.
Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com
Fixes: c193707dde ("tracing: Remove code which merges duplicates")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 71cd7e80cfde548959952eac7063aeaea1f2e1c6 upstream.
An S4 (suspend to disk) test on the LoongArch 3A6000 platform sometimes
fails with the following error messaged in the dmesg log:
Invalid LZO compressed length
That happens because when compressing/decompressing the image, the
synchronization between the control thread and the compress/decompress/crc
thread is based on a relaxed ordering interface, which is unreliable, and the
following situation may occur:
CPU 0 CPU 1
save_image_lzo lzo_compress_threadfn
atomic_set(&d->stop, 1);
atomic_read(&data[thr].stop)
data[thr].cmp = data[thr].cmp_len;
WRITE data[thr].cmp_len
Then CPU0 gets a stale cmp_len and writes it to disk. During resume from S4,
wrong cmp_len is loaded.
To maintain data consistency between the two threads, use the acquire/release
variants of atomic set and read operations.
Fixes: 081a9d043c ("PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Co-developed-by: Weihao Li <liweihao@loongson.cn>
Signed-off-by: Weihao Li <liweihao@loongson.cn>
[ rjw: Subject rewrite and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
https://source.android.com/docs/security/bulletin/2024-02-01
* tag 'ASB-2024-02-05_4.19-stable' of https://android.googlesource.com/kernel/common:
Reapply "perf: Fix perf_event_validate_size()"
UPSTREAM: usb: raw-gadget: properly handle interrupted requests
UPSTREAM: mm/cma: use nth_page() in place of direct struct page manipulation
UPSTREAM: wireguard: allowedips: expand maximum node depth
UPSTREAM: coresight: tmc: Explicit type conversions to prevent integer overflow
UPSTREAM: wireguard: netlink: send staged packets when setting initial private key
UPSTREAM: wireguard: queueing: use saner cpu selection wrapping
UPSTREAM: kheaders: Use array declaration instead of char
UPSTREAM: arm64: efi: Make efi_rt_lock a raw_spinlock
UPSTREAM: sched/psi: Fix use-after-free in ep_remove_wait_queue()
UPSTREAM: usb: musb: mediatek: don't unregister something that wasn't registered
UPSTREAM: xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
UPSTREAM: xfrm: compat: change expression for switch in xfrm_xlate64
UPSTREAM: perf/core: Call LSM hook after copying perf_event_attr
Linux 4.19.306
crypto: scompress - initialize per-CPU variables on each CPU
Revert "NFSD: Fix possible sleep during nfsd4_release_lockowner()"
i2c: s3c24xx: fix transferring more than one message in polling mode
i2c: s3c24xx: fix read transfers in polling mode
kdb: Fix a potential buffer overflow in kdb_local()
kdb: Censor attempts to set PROMPT without ENABLE_MEM_READ
ipvs: avoid stat macros calls from preemptible context
net: dsa: vsc73xx: Add null pointer check to vsc73xx_gpio_probe
net: ravb: Fix dma_addr_t truncation in error case
net: qualcomm: rmnet: fix global oob in rmnet_policy
serial: imx: Correct clock error message in function probe()
apparmor: avoid crash when parsed profile name is empty
perf genelf: Set ELF program header addresses properly
acpi: property: Let args be NULL in __acpi_node_get_property_reference
MIPS: Alchemy: Fix an out-of-bound access in db1550_dev_setup()
MIPS: Alchemy: Fix an out-of-bound access in db1200_dev_setup()
HID: wacom: Correct behavior when processing some confidence == false touches
wifi: mwifiex: configure BSSID consistently when starting AP
wifi: rtlwifi: Convert LNKCTL change to PCIe cap RMW accessors
wifi: rtlwifi: Remove bogus and dangerous ASPM disable/enable code
fbdev: flush deferred work in fb_deferred_io_fsync()
ALSA: oxygen: Fix right channel of capture volume mixer
usb: mon: Fix atomicity violation in mon_bin_vma_fault
usb: typec: class: fix typec_altmode_put_partner to put plugs
Revert "usb: typec: class: fix typec_altmode_put_partner to put plugs"
usb: chipidea: wait controller resume finished for wakeup irq
Revert "usb: dwc3: don't reset device side if dwc3 was configured as host-only"
Revert "usb: dwc3: Soft reset phy on probe for host"
usb: dwc: ep0: Update request status in dwc3_ep0_stall_restart
usb: phy: mxs: remove CONFIG_USB_OTG condition for mxs_phy_is_otg_host()
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
binder: fix unused alloc->free_async_space
binder: fix race between mmput() and do_exit()
xen-netback: don't produce zero-size SKB frags
Revert "ASoC: atmel: Remove system clock tree configuration for at91sam9g20ek"
Input: atkbd - use ab83 as id when skipping the getid command
binder: fix async space check for 0-sized buffers
of: unittest: Fix of_count_phandle_with_args() expected value message
of: Fix double free in of_parse_phandle_with_args_map
mmc: sdhci_omap: Fix TI SoC dependencies
watchdog: bcm2835_wdt: Fix WDIOC_SETTIMEOUT handling
watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO
watchdog: set cdev owner before adding
gpu/drm/radeon: fix two memleaks in radeon_vm_init
drivers/amd/pm: fix a use-after-free in kv_parse_power_table
drm/amd/pm: fix a double-free in si_dpm_init
drm/amdgpu/debugfs: fix error code when smc register accessors are NULL
media: dvbdev: drop refcount on error path in dvb_device_open()
media: cx231xx: fix a memleak in cx231xx_init_isoc
drm/radeon/trinity_dpm: fix a memleak in trinity_parse_power_table
drm/radeon/dpm: fix a memleak in sumo_parse_power_table
drm/radeon: check the alloc_workqueue return value in radeon_crtc_init()
drm/drv: propagate errors from drm_modeset_register_all()
drm/msm/mdp4: flush vblank event on disable
ASoC: cs35l34: Fix GPIO name and drop legacy include
ASoC: cs35l33: Fix GPIO name and drop legacy include
drm/radeon: check return value of radeon_ring_lock()
drm/radeon/r100: Fix integer overflow issues in r100_cs_track_check()
drm/radeon/r600_cs: Fix possible int overflows in r600_cs_check_reg()
f2fs: fix to avoid dirent corruption
drm/bridge: Fix typo in post_disable() description
media: pvrusb2: fix use after free on context disconnection
RDMA/usnic: Silence uninitialized symbol smatch warnings
ip6_tunnel: fix NEXTHDR_FRAGMENT handling in ip6_tnl_parse_tlv_enc_lim()
Bluetooth: btmtkuart: fix recv_buf() return value
Bluetooth: Fix bogus check for re-auth no supported with non-ssp
wifi: rtlwifi: rtl8192se: using calculate_bit_shift()
wifi: rtlwifi: rtl8192ee: using calculate_bit_shift()
wifi: rtlwifi: rtl8192de: using calculate_bit_shift()
rtlwifi: rtl8192de: make arrays static const, makes object smaller
wifi: rtlwifi: rtl8192ce: using calculate_bit_shift()
wifi: rtlwifi: rtl8192cu: using calculate_bit_shift()
wifi: rtlwifi: rtl8192c: using calculate_bit_shift()
wifi: rtlwifi: rtl8188ee: phy: using calculate_bit_shift()
wifi: rtlwifi: add calculate_bit_shift()
dma-mapping: clear dev->dma_mem to NULL after freeing it
scsi: hisi_sas: Replace with standard error code return value
wifi: rtlwifi: rtl8821ae: phy: fix an undefined bitwise shift behavior
rtlwifi: Use ffs in <foo>_phy_calculate_bit_shift
firmware: ti_sci: Fix an off-by-one in ti_sci_debugfs_create()
net/ncsi: Fix netlink major/minor version numbers
ncsi: internal.h: Fix a spello
ARM: dts: qcom: apq8064: correct XOADC register address
wifi: libertas: stop selecting wext
bpf, lpm: Fix check prefixlen before walking trie
NFSv4.1/pnfs: Ensure we handle the error NFS4ERR_RETURNCONFLICT
blocklayoutdriver: Fix reference leak of pnfs_device_node
crypto: scomp - fix req->dst buffer overflow
crypto: scompress - Use per-CPU struct instead multiple variables
crypto: scompress - return proper error code for allocation failure
crypto: sahara - do not resize req->src when doing hash operations
crypto: sahara - fix processing hash requests with req->nbytes < sg->length
crypto: sahara - improve error handling in sahara_sha_process()
crypto: sahara - fix wait_for_completion_timeout() error handling
crypto: sahara - fix ahash reqsize
crypto: virtio - Wait for tasklet to complete on device remove
pstore: ram_core: fix possible overflow in persistent_ram_init_ecc()
crypto: sahara - fix error handling in sahara_hw_descriptor_create()
crypto: sahara - fix processing requests with cryptlen < sg->length
crypto: sahara - fix ahash selftest failure
crypto: sahara - remove FLAGS_NEW_KEY logic
crypto: af_alg - Disallow multiple in-flight AIO requests
crypto: ccp - fix memleak in ccp_init_dm_workarea
crypto: virtio - Handle dataq logic with tasklet
selinux: Fix error priority for bind with AF_UNSPEC on PF_INET6 socket
mtd: Fix gluebi NULL pointer dereference caused by ftl notifier
calipso: fix memory leak in netlbl_calipso_add_pass()
netlabel: remove unused parameter in netlbl_netlink_auditinfo()
net: netlabel: Fix kerneldoc warnings
ACPI: LPIT: Avoid u32 multiplication overflow
ACPI: video: check for error while searching for backlight device parent
mtd: rawnand: Increment IFC_TIMEOUT_MSECS for nand controller response
powerpc/imc-pmu: Add a null pointer check in update_events_in_group()
powerpc/powernv: Add a null pointer check in opal_event_init()
selftests/powerpc: Fix error handling in FPU/VMX preemption tests
powerpc/pseries/memhp: Fix access beyond end of drmem array
powerpc/pseries/memhotplug: Quieten some DLPAR operations
powerpc/44x: select I2C for CURRITUCK
powerpc: remove redundant 'default n' from Kconfig-s
powerpc: add crtsavres.o to always-y instead of extra-y
EDAC/thunderx: Fix possible out-of-bounds string access
x86/lib: Fix overflow when counting digits
coresight: etm4x: Fix width of CCITMIN field
uio: Fix use-after-free in uio_open
binder: fix comment on binder_alloc_new_buf() return value
binder: use EPOLLERR from eventpoll.h
drm/crtc: fix uninitialized variable use
ARM: sun9i: smp: fix return code check of of_property_match_string
Input: xpad - add Razer Wolverine V2 support
ARC: fix spare error
s390/scm: fix virtual vs physical address confusion
Input: i8042 - add nomux quirk for Acer P459-G2-M
Input: atkbd - skip ATKBD_CMD_GETID in translated mode
reset: hisilicon: hi6220: fix Wvoid-pointer-to-enum-cast warning
ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
tracing: Add size check when printing trace_marker output
tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
drm/crtc: Fix uninit-value bug in drm_mode_setcrtc
jbd2: correct the printing of write_flags in jbd2_write_superblock()
clk: rockchip: rk3128: Fix HCLK_OTG gate register
drm/exynos: fix a potential error pointer dereference
ASoC: da7219: Support low DC impedance headset
net/tg3: fix race condition in tg3_reset_task()
ASoC: rt5650: add mutex to avoid the jack detection failure
ASoC: cs43130: Fix incorrect frame delay configuration
ASoC: cs43130: Fix the position of const qualifier
ASoC: Intel: Skylake: mem leak in skl register function
f2fs: explicitly null-terminate the xattr list
UPSTREAM: wifi: cfg80211: fix buffer overflow in elem comparison
UPSTREAM: gcov: clang: fix the buffer overflow issue
BACKPORT: selinux: enable use of both GFP_KERNEL and GFP_ATOMIC in convert_context()
UPSTREAM: wifi: cfg80211: avoid nontransmitted BSS list corruption
UPSTREAM: wifi: cfg80211: fix BSS refcounting bugs
UPSTREAM: wifi: cfg80211: ensure length byte is present before access
UPSTREAM: wifi: cfg80211: fix u8 overflow in cfg80211_update_notlisted_nontrans()
UPSTREAM: wireguard: netlink: avoid variable-sized memcpy on sockaddr
UPSTREAM: wireguard: ratelimiter: disable timings test by default
UPSTREAM: crypto: lib - remove unneeded selection of XOR_BLOCKS
UPSTREAM: wireguard: allowedips: don't corrupt stack when detecting overflow
UPSTREAM: wireguard: ratelimiter: use hrtimer in selftest
UPSTREAM: crypto: arm64/poly1305 - fix a read out-of-bound
UPSTREAM: wifi: mac80211_hwsim: set virtio device ready in probe()
UPSTREAM: crypto: memneq - move into lib/
UPSTREAM: dma-buf: fix use of DMA_BUF_SET_NAME_{A,B} in userspace
BACKPORT: usb: typec: mux: Check dev_set_name() return value
UPSTREAM: wireguard: device: check for metadata_dst with skb_valid_dst()
UPSTREAM: sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
UPSTREAM: cfg80211: hold bss_lock while updating nontrans_list
UPSTREAM: wireguard: socket: ignore v6 endpoints when ipv6 is disabled
UPSTREAM: wireguard: socket: free skb in send6 when ipv6 is disabled
UPSTREAM: wireguard: queueing: use CFI-safe ptr_ring cleanup function
UPSTREAM: mm: don't try to NUMA-migrate COW pages that have other uses
UPSTREAM: copy_process(): Move fd_install() out of sighand->siglock critical section
UPSTREAM: usb: raw-gadget: fix handling of dual-direction-capable endpoints
UPSTREAM: psi: Fix "no previous prototype" warnings when CONFIG_CGROUPS=n
UPSTREAM: sched/uclamp: Fix rq->uclamp_max not set on first enqueue
UPSTREAM: wireguard: selftests: increase default dmesg log size
UPSTREAM: wireguard: allowedips: add missing __rcu annotation to satisfy sparse
UPSTREAM: sched/uclamp: Fix uclamp_tg_restrict()
UPSTREAM: coresight: etm4x: Fix etm4_count race by moving cpuhp callbacks to init
UPSTREAM: sched/uclamp: Fix a deadlock when enabling uclamp static key
UPSTREAM: mac80211_hwsim: use GFP_ATOMIC under spin lock
UPSTREAM: usercopy: Avoid soft lockups in test_check_nonzero_user()
UPSTREAM: lib: test_user_copy: style cleanup
UPSTREAM: fork: return proper negative error code
Revert "ipv6: make ip6_rt_gc_expire an atomic_t"
Revert "ipv6: remove max_size check inline with ipv4"
Linux 4.19.305
ipv6: remove max_size check inline with ipv4
ipv6: make ip6_rt_gc_expire an atomic_t
net/dst: use a smaller percpu_counter batch for dst entries accounting
net: add a route cache full diagnostic message
PCI: Disable ATS for specific Intel IPU E2000 devices
PCI: Extract ATS disabling to a helper function
netfilter: nf_tables: Reject tables of unsupported family
fuse: nlookup missing decrement in fuse_direntplus_link
mmc: core: Cancel delayed work before releasing host
mmc: rpmb: fixes pause retune on all RPMB partitions.
mm: fix unmap_mapping_range high bits shift bug
firewire: ohci: suppress unexpected system reboot in AMD Ryzen machines and ASM108x/VT630x PCIe cards
mm/memory-failure: check the mapcount of the precise page
bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()
asix: Add check for usbnet_get_endpoints
net/qla3xxx: fix potential memleak in ql_alloc_buffer_queues
net/qla3xxx: switch from 'pci_' to 'dma_' API
i40e: Restore VF MSI-X state during PCI reset
i40e: fix use-after-free in i40e_aqc_add_filters()
net: Save and restore msg_namelen in sock_sendmsg
net: bcmgenet: Fix FCS generation for fragmented skbuffs
ARM: sun9i: smp: Fix array-index-out-of-bounds read in sunxi_mc_smp_init
net: sched: em_text: fix possible memory leak in em_text_destroy()
i40e: Fix filter input checks to prevent config with invalid values
nfc: llcp_core: Hold a ref to llcp_local->dev when holding a ref to llcp_local
UPSTREAM: fsverity: skip PKCS#7 parser when keyring is empty
Conflicts:
drivers/hwtracing/coresight/coresight-etm4x.c
drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
include/linux/psi.h
mm/memory-failure.c
net/wireless/scan.c
Change-Id: I49b769cb8387e5d5f28730d13cbdf4ffd335dc70
Under CONFIG_FORTIFY_SOURCE, memcpy() will check the size of destination
and source buffers. Defining kernel_headers_data as "char" would trip
this check. Since these addresses are treated as byte arrays, define
them as arrays (as done everywhere else).
This was seen with:
$ cat /sys/kernel/kheaders.tar.xz >> /dev/null
detected buffer overflow in memcpy
kernel BUG at lib/string_helpers.c:1027!
...
RIP: 0010:fortify_panic+0xf/0x20
[...]
Call Trace:
<TASK>
ikheaders_read+0x45/0x50 [kheaders]
kernfs_fop_read_iter+0x1a4/0x2f0
...
Bug: 254441685
Reported-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20230302112130.6e402a98@kernel.org/
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 43d8ce9d65a5 ("Provide in-kernel headers to make extending kernel easier")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230302224946.never.243-kees@kernel.org
(cherry picked from commit b69edab47f1da8edd8e7bfdf8c70f51a2a5d89fb)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I73c7530b9c558c1c8dac5f8962dbc31c553c0be7
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path:
do_rmdir
cgroup_rmdir
kernfs_drain_open_files
cgroup_file_release
cgroup_pressure_release
psi_trigger_destroy
However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit:
fput
ep_eventpoll_release
ep_free
ep_remove_wait_queue
remove_wait_queue
This results in use-after-free as pasted below.
The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it is in line
with the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()")
since the waitqueue's lifetime is not tied to file's one and can be
considered as another special case. While this would be fixable by somehow
making cgroup_file_release() be tied to the fput(), it would require
sizable refactoring at cgroups or higher layer which might be more
justifiable if we identify more cases like this.
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
Write of size 4 at addr ffff88810e625328 by task a.out/4404
CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
Call Trace:
<TASK>
dump_stack_lvl+0x73/0xa0
print_report+0x16c/0x4e0
kasan_report+0xc3/0xf0
kasan_check_range+0x2d2/0x310
_raw_spin_lock_irqsave+0x60/0xc0
remove_wait_queue+0x1a/0xa0
ep_free+0x12c/0x170
ep_eventpoll_release+0x26/0x30
__fput+0x202/0x400
task_work_run+0x11d/0x170
do_exit+0x495/0x1130
do_group_exit+0x100/0x100
get_signal+0xd67/0xde0
arch_do_signal_or_restart+0x2a/0x2b0
exit_to_user_mode_prepare+0x94/0x100
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x52/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
</TASK>
Allocated by task 4404:
kasan_set_track+0x3d/0x60
__kasan_kmalloc+0x85/0x90
psi_trigger_create+0x113/0x3e0
pressure_write+0x146/0x2e0
cgroup_file_write+0x11c/0x250
kernfs_fop_write_iter+0x186/0x220
vfs_write+0x3d8/0x5c0
ksys_write+0x90/0x110
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 4407:
kasan_set_track+0x3d/0x60
kasan_save_free_info+0x27/0x40
____kasan_slab_free+0x11d/0x170
slab_free_freelist_hook+0x87/0x150
__kmem_cache_free+0xcb/0x180
psi_trigger_destroy+0x2e8/0x310
cgroup_file_release+0x4f/0xb0
kernfs_drain_open_files+0x165/0x1f0
kernfs_drain+0x162/0x1a0
__kernfs_remove+0x1fb/0x310
kernfs_remove_by_name_ns+0x95/0xe0
cgroup_addrm_files+0x67f/0x700
cgroup_destroy_locked+0x283/0x3c0
cgroup_rmdir+0x29/0x100
kernfs_iop_rmdir+0xd1/0x140
vfs_rmdir+0xfe/0x240
do_rmdir+0x13d/0x280
__x64_sys_rmdir+0x2c/0x30
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Bug: 254441685
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com
(cherry picked from commit c2dbe32d5db5c4ead121cf86dabd5ab691fb47fe)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I9677499b2885149a1070f508931113ad8a02277a
Changes in 4.19.306
f2fs: explicitly null-terminate the xattr list
ASoC: Intel: Skylake: mem leak in skl register function
ASoC: cs43130: Fix the position of const qualifier
ASoC: cs43130: Fix incorrect frame delay configuration
ASoC: rt5650: add mutex to avoid the jack detection failure
net/tg3: fix race condition in tg3_reset_task()
ASoC: da7219: Support low DC impedance headset
drm/exynos: fix a potential error pointer dereference
clk: rockchip: rk3128: Fix HCLK_OTG gate register
jbd2: correct the printing of write_flags in jbd2_write_superblock()
drm/crtc: Fix uninit-value bug in drm_mode_setcrtc
tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
tracing: Add size check when printing trace_marker output
ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
reset: hisilicon: hi6220: fix Wvoid-pointer-to-enum-cast warning
Input: atkbd - skip ATKBD_CMD_GETID in translated mode
Input: i8042 - add nomux quirk for Acer P459-G2-M
s390/scm: fix virtual vs physical address confusion
ARC: fix spare error
Input: xpad - add Razer Wolverine V2 support
ARM: sun9i: smp: fix return code check of of_property_match_string
drm/crtc: fix uninitialized variable use
binder: use EPOLLERR from eventpoll.h
binder: fix comment on binder_alloc_new_buf() return value
uio: Fix use-after-free in uio_open
coresight: etm4x: Fix width of CCITMIN field
x86/lib: Fix overflow when counting digits
EDAC/thunderx: Fix possible out-of-bounds string access
powerpc: add crtsavres.o to always-y instead of extra-y
powerpc: remove redundant 'default n' from Kconfig-s
powerpc/44x: select I2C for CURRITUCK
powerpc/pseries/memhotplug: Quieten some DLPAR operations
powerpc/pseries/memhp: Fix access beyond end of drmem array
selftests/powerpc: Fix error handling in FPU/VMX preemption tests
powerpc/powernv: Add a null pointer check in opal_event_init()
powerpc/imc-pmu: Add a null pointer check in update_events_in_group()
mtd: rawnand: Increment IFC_TIMEOUT_MSECS for nand controller response
ACPI: video: check for error while searching for backlight device parent
ACPI: LPIT: Avoid u32 multiplication overflow
net: netlabel: Fix kerneldoc warnings
netlabel: remove unused parameter in netlbl_netlink_auditinfo()
calipso: fix memory leak in netlbl_calipso_add_pass()
mtd: Fix gluebi NULL pointer dereference caused by ftl notifier
selinux: Fix error priority for bind with AF_UNSPEC on PF_INET6 socket
crypto: virtio - Handle dataq logic with tasklet
crypto: ccp - fix memleak in ccp_init_dm_workarea
crypto: af_alg - Disallow multiple in-flight AIO requests
crypto: sahara - remove FLAGS_NEW_KEY logic
crypto: sahara - fix ahash selftest failure
crypto: sahara - fix processing requests with cryptlen < sg->length
crypto: sahara - fix error handling in sahara_hw_descriptor_create()
pstore: ram_core: fix possible overflow in persistent_ram_init_ecc()
crypto: virtio - Wait for tasklet to complete on device remove
crypto: sahara - fix ahash reqsize
crypto: sahara - fix wait_for_completion_timeout() error handling
crypto: sahara - improve error handling in sahara_sha_process()
crypto: sahara - fix processing hash requests with req->nbytes < sg->length
crypto: sahara - do not resize req->src when doing hash operations
crypto: scompress - return proper error code for allocation failure
crypto: scompress - Use per-CPU struct instead multiple variables
crypto: scomp - fix req->dst buffer overflow
blocklayoutdriver: Fix reference leak of pnfs_device_node
NFSv4.1/pnfs: Ensure we handle the error NFS4ERR_RETURNCONFLICT
bpf, lpm: Fix check prefixlen before walking trie
wifi: libertas: stop selecting wext
ARM: dts: qcom: apq8064: correct XOADC register address
ncsi: internal.h: Fix a spello
net/ncsi: Fix netlink major/minor version numbers
firmware: ti_sci: Fix an off-by-one in ti_sci_debugfs_create()
rtlwifi: Use ffs in <foo>_phy_calculate_bit_shift
wifi: rtlwifi: rtl8821ae: phy: fix an undefined bitwise shift behavior
scsi: hisi_sas: Replace with standard error code return value
dma-mapping: clear dev->dma_mem to NULL after freeing it
wifi: rtlwifi: add calculate_bit_shift()
wifi: rtlwifi: rtl8188ee: phy: using calculate_bit_shift()
wifi: rtlwifi: rtl8192c: using calculate_bit_shift()
wifi: rtlwifi: rtl8192cu: using calculate_bit_shift()
wifi: rtlwifi: rtl8192ce: using calculate_bit_shift()
rtlwifi: rtl8192de: make arrays static const, makes object smaller
wifi: rtlwifi: rtl8192de: using calculate_bit_shift()
wifi: rtlwifi: rtl8192ee: using calculate_bit_shift()
wifi: rtlwifi: rtl8192se: using calculate_bit_shift()
Bluetooth: Fix bogus check for re-auth no supported with non-ssp
Bluetooth: btmtkuart: fix recv_buf() return value
ip6_tunnel: fix NEXTHDR_FRAGMENT handling in ip6_tnl_parse_tlv_enc_lim()
RDMA/usnic: Silence uninitialized symbol smatch warnings
media: pvrusb2: fix use after free on context disconnection
drm/bridge: Fix typo in post_disable() description
f2fs: fix to avoid dirent corruption
drm/radeon/r600_cs: Fix possible int overflows in r600_cs_check_reg()
drm/radeon/r100: Fix integer overflow issues in r100_cs_track_check()
drm/radeon: check return value of radeon_ring_lock()
ASoC: cs35l33: Fix GPIO name and drop legacy include
ASoC: cs35l34: Fix GPIO name and drop legacy include
drm/msm/mdp4: flush vblank event on disable
drm/drv: propagate errors from drm_modeset_register_all()
drm/radeon: check the alloc_workqueue return value in radeon_crtc_init()
drm/radeon/dpm: fix a memleak in sumo_parse_power_table
drm/radeon/trinity_dpm: fix a memleak in trinity_parse_power_table
media: cx231xx: fix a memleak in cx231xx_init_isoc
media: dvbdev: drop refcount on error path in dvb_device_open()
drm/amdgpu/debugfs: fix error code when smc register accessors are NULL
drm/amd/pm: fix a double-free in si_dpm_init
drivers/amd/pm: fix a use-after-free in kv_parse_power_table
gpu/drm/radeon: fix two memleaks in radeon_vm_init
watchdog: set cdev owner before adding
watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO
watchdog: bcm2835_wdt: Fix WDIOC_SETTIMEOUT handling
mmc: sdhci_omap: Fix TI SoC dependencies
of: Fix double free in of_parse_phandle_with_args_map
of: unittest: Fix of_count_phandle_with_args() expected value message
binder: fix async space check for 0-sized buffers
Input: atkbd - use ab83 as id when skipping the getid command
Revert "ASoC: atmel: Remove system clock tree configuration for at91sam9g20ek"
xen-netback: don't produce zero-size SKB frags
binder: fix race between mmput() and do_exit()
binder: fix unused alloc->free_async_space
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
usb: phy: mxs: remove CONFIG_USB_OTG condition for mxs_phy_is_otg_host()
usb: dwc: ep0: Update request status in dwc3_ep0_stall_restart
Revert "usb: dwc3: Soft reset phy on probe for host"
Revert "usb: dwc3: don't reset device side if dwc3 was configured as host-only"
usb: chipidea: wait controller resume finished for wakeup irq
Revert "usb: typec: class: fix typec_altmode_put_partner to put plugs"
usb: typec: class: fix typec_altmode_put_partner to put plugs
usb: mon: Fix atomicity violation in mon_bin_vma_fault
ALSA: oxygen: Fix right channel of capture volume mixer
fbdev: flush deferred work in fb_deferred_io_fsync()
wifi: rtlwifi: Remove bogus and dangerous ASPM disable/enable code
wifi: rtlwifi: Convert LNKCTL change to PCIe cap RMW accessors
wifi: mwifiex: configure BSSID consistently when starting AP
HID: wacom: Correct behavior when processing some confidence == false touches
MIPS: Alchemy: Fix an out-of-bound access in db1200_dev_setup()
MIPS: Alchemy: Fix an out-of-bound access in db1550_dev_setup()
acpi: property: Let args be NULL in __acpi_node_get_property_reference
perf genelf: Set ELF program header addresses properly
apparmor: avoid crash when parsed profile name is empty
serial: imx: Correct clock error message in function probe()
net: qualcomm: rmnet: fix global oob in rmnet_policy
net: ravb: Fix dma_addr_t truncation in error case
net: dsa: vsc73xx: Add null pointer check to vsc73xx_gpio_probe
ipvs: avoid stat macros calls from preemptible context
kdb: Censor attempts to set PROMPT without ENABLE_MEM_READ
kdb: Fix a potential buffer overflow in kdb_local()
i2c: s3c24xx: fix read transfers in polling mode
i2c: s3c24xx: fix transferring more than one message in polling mode
Revert "NFSD: Fix possible sleep during nfsd4_release_lockowner()"
crypto: scompress - initialize per-CPU variables on each CPU
Linux 4.19.306
Change-Id: Ib746be8cff1e4086680c032a03b0fc0ab5968a51
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 4f41d30cd6dc865c3cbc1a852372321eba6d4e4c ]
When appending "[defcmd]" to 'kdb_prompt_str', the size of the string
already in the buffer should be taken into account.
An option could be to switch from strncat() to strlcat() which does the
correct test to avoid such an overflow.
However, this actually looks as dead code, because 'defcmd_in_progress'
can't be true here.
See a more detailed explanation at [1].
[1]: https://lore.kernel.org/all/CAD=FV=WSh7wKN7Yp-3wWiDgX4E3isQ8uh0LCzTmd1v9Cg9j+nQ@mail.gmail.com/
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ad99b5105c0823ff02126497f4366e6a8009453e ]
Currently the PROMPT variable could be abused to provoke the printf()
machinery to read outside the current stack frame. Normally this
doesn't matter becaues md is already a much better tool for reading
from memory.
However the md command can be disabled by not setting KDB_ENABLE_MEM_READ.
Let's also prevent PROMPT from being modified in these circumstances.
Whilst adding a comment to help future code reviewers we also remove
the #ifdef where PROMPT in consumed. There is no problem passing an
unused (0) to snprintf when !CONFIG_SMP.
argument
Reported-by: Wang Xiayang <xywang.sjtu@sjtu.edu.cn>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Stable-dep-of: 4f41d30cd6dc ("kdb: Fix a potential buffer overflow in kdb_local()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 71fee48fb772ac4f6cfa63dbebc5629de8b4cc09 upstream.
When offlining and onlining CPUs the overall reported idle and iowait
times as reported by /proc/stat jump backward and forward:
cpu 132 0 176 225249 47 6 6 21 0 0
cpu0 80 0 115 112575 33 3 4 18 0 0
cpu1 52 0 60 112673 13 3 1 2 0 0
cpu 133 0 177 226681 47 6 6 21 0 0
cpu0 80 0 116 113387 33 3 4 18 0 0
cpu 133 0 178 114431 33 6 6 21 0 0 <---- jump backward
cpu0 80 0 116 114247 33 3 4 18 0 0
cpu1 52 0 61 183 0 3 1 2 0 0 <---- idle + iowait start with 0
cpu 133 0 178 228956 47 6 6 21 0 0 <---- jump forward
cpu0 81 0 117 114929 33 3 4 18 0 0
Reason for this is that get_idle_time() in fs/proc/stat.c has different
sources for both values depending on if a CPU is online or offline:
- if a CPU is online the values may be taken from its per cpu
tick_cpu_sched structure
- if a CPU is offline the values are taken from its per cpu cpustat
structure
The problem is that the per cpu tick_cpu_sched structure is set to zero on
CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c.
Therefore when a CPU is brought offline and online afterwards both its idle
and iowait sleeptime will be zero, causing a jump backward in total system
idle and iowait sleeptime. In a similar way if a CPU is then brought
offline again the total idle and iowait sleeptimes will jump forward.
It looks like this behavior was introduced with commit 4b0c0f294f
("tick: Cleanup NOHZ per cpu data on cpu down").
This was only noticed now on s390, since we switched to generic idle time
reporting with commit be76ea614460 ("s390/idle: remove arch_cpu_idle_time()
and corresponding code").
Fix this by preserving the values of idle_sleeptime and iowait_sleeptime
members of the per-cpu tick_sched structure on CPU hotplug.
Fixes: 4b0c0f294f ("tick: Cleanup NOHZ per cpu data on cpu down")
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240115163555.1004144-1-hca@linux.ibm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b07bc2347672cc8c7293c64499f1488278c5ca3d ]
Reproduced with below sequence:
dma_declare_coherent_memory()->dma_release_coherent_memory()
->dma_declare_coherent_memory()->"return -EBUSY" error
It will return -EBUSY from the dma_assign_coherent_memory()
in dma_declare_coherent_memory(), the reason is that dev->dma_mem
pointer has not been set to NULL after it's freed.
Fixes: cf65a0f6f6 ("dma-mapping: move all DMA mapping code to kernel/dma")
Signed-off-by: Joakim Zhang <joakim.zhang@cixtech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9b75dbeb36fcd9fc7ed51d370310d0518a387769 ]
When looking up an element in LPM trie, the condition 'matchlen ==
trie->max_prefixlen' will never return true, if key->prefixlen is larger
than trie->max_prefixlen. Consequently all elements in the LPM trie will
be visited and no element is returned in the end.
To resolve this, check key->prefixlen first before walking the LPM trie.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60be76eeabb3d83858cc6577fc65c7d0f36ffd42 ]
If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:
trace_seq_printf(s, ": %s", field->buf);
The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.
int max = iter->ent_size - offsetof(struct print_entry, buf);
trace_seq_printf(s, ": %*.s", max, field->buf);
Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Using msleep() is problematic because it's compared against
ratelimiter.c's ktime_get_coarse_boottime_ns(), which means on systems
with slow jiffies (such as UML's forced HZ=100), the result is
inaccurate. So switch to using schedule_hrtimeout().
However, hrtimer gives us access only to the traditional posix timers,
and none of the _COARSE variants. So now, rather than being too
imprecise like jiffies, it's too precise.
One solution would be to give it a large "range" value, but this will
still fire early on a loaded system. A better solution is to align the
timeout to the actual coarse timer, and then round up to the nearest
tick, plus change.
So add the timeout to the current coarse time, and then
schedule_hrtimer() until the absolute computed time.
This should hopefully reduce flakes in CI as well. Note that we keep the
retry loop in case the entire function is running behind, because the
test could still be scheduled out, by either the kernel or by the
hypervisor's kernel, in which case restarting the test and hoping to not
be scheduled out still helps.
Bug: 254441685
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit 151c8e499f4705010780189377f85b57400ccbf5)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Iee1787011a726c202b607cd0e44f196f4d6af642
Since commit 23127296889f ("sched/fair: Update scale invariance of PELT")
change to use rq_clock_pelt() instead of rq_clock_task(), we should also
use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task
accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And
rename throttled_clock_task(_time) to be clock_pelt rather than clock_task.
Bug: 254441685
Fixes: 23127296889f ("sched/fair: Update scale invariance of PELT")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com
(cherry picked from commit 64eaf50731ac0a8c76ce2fedd50ef6652aabc5ff)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I61e971d09f14708b8ee170fd5d5109144bba6e34