Commit Graph

11528 Commits

Author SHA1 Message Date
Brooks Davis
a6fffd6cb0 The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175.  Remove all definitions, documentation, and usage.

fifo_misc.c:
	Remove all kqueue tests as fifo_io.c performs all those that
	would have remained.

Reviewed by:	rwatson
MFC after:	3 weeks
X-MFC note:	don't change vlan_link_state() function signature
2009-12-31 20:29:58 +00:00
Konstantin Belousov
17c4c3563c Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by:	pho
Reviewed by:	jhb
MFC after:	1 week
2009-12-31 18:52:58 +00:00
John Baldwin
6a41dc1043 Actually set RLE_ALLOCATED when allocating a reserved resource so that
resource_list_release() will later release the resource instead of failing.
2009-12-30 22:37:28 +00:00
John Baldwin
0eb9893a80 - Assert that a reserved resource returned via resource_list_alloc() is not
active.
- Fix bus_generic_rl_(alloc|release)_resource() to not attempt to fetch a
  resource list for grandchild devices, but just pass those requests up to
  the parent directly.  This worked by accident previously, but it is
  better to not let bus drivers try to operate on devices they do not
  manage.
2009-12-30 19:44:31 +00:00
Robert Noland
cfd7bacef2 Update d_mmap() to accept vm_ooffset_t and vm_memattr_t.
This replaces d_mmap() with the d_mmap2() implementation and also
changes the type of offset to vm_ooffset_t.

Purge d_mmap2().

All driver modules will need to be rebuilt since D_VERSION is also
bumped.

Reviewed by:	jhb@
MFC after:	Not in this lifetime...
2009-12-29 21:51:28 +00:00
Edward Tomasz Napierala
6cb02977e2 SLIP is gone; remove its mutex from witness. 2009-12-29 08:45:27 +00:00
Ed Schouten
62375ca8c1 Don't forget to use `void' for sched_balance(). It has no arguments. 2009-12-28 23:12:12 +00:00
Antoine Brodin
13e403fdea (S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR:		137213
Submitted by:	Eygene Ryabinkin (initial version)
MFC after:	1 month
2009-12-28 22:56:30 +00:00
Konstantin Belousov
a411786576 Add a knob to allow reclaim of the directory vnodes that are source of
the namecache records. The reclamation is not enabled by default because
for typical workload it would make namecache unusable, but large nested
directory tree easily puts any process that accesses filesystem into 1
second wait for vlru.

Reported by:	yar (long time ago)
MFC after:	3 days
2009-12-28 15:35:39 +00:00
Edward Tomasz Napierala
558e9b5c95 Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND
is not being used improperly.
2009-12-26 11:36:10 +00:00
Bjoern A. Zeeb
e4ff598ec6 Remove extra spaces (no functional change).
MFC after:	3 days
2009-12-25 21:14:05 +00:00
Bjoern A. Zeeb
095809b084 Remove an unused global.
MFC after:	3 days
2009-12-25 20:03:03 +00:00
Robert Watson
c7ca33d138 Minor comment tweaks in rmlocks.
MFC after:	3 days
2009-12-25 01:16:24 +00:00
Konstantin Belousov
49e3050e6c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
Ed Schouten
907b48bc05 Fix indentation. 2009-12-20 22:55:27 +00:00
Ed Schouten
8dc9b4cf04 Let access overriding to TTYs depend on the cdev_priv, not the vnode.
Basically this commit changes two things, which improves access to TTYs
in exceptional conditions. Basically the problem was that when you ran
jexec(8) to attach to a jail, you couldn't use /dev/tty (well, also the
node of the actual TTY, e.g. /dev/pts/X). This is very inconvenient if
you want to attach to screens quickly, use ssh(1), etc.

The fixes:

- Cache the cdev_priv of the controlling TTY in struct session. Change
  devfs_access() to compare against the cdev_priv instead of the vnode.
  This allows you to bypass UNIX permissions, even across different
  mounts of devfs.

- Extend devfs_prison_check() to unconditionally expose the device node
  of the controlling TTY, even if normal prison nesting rules normally
  don't allow this. This actually allows you to interact with this
  device node.

To be honest, I'm not really happy with this solution. We now have to
store three pointers to a controlling TTY (s_ttyp, s_ttyvp, s_ttydp).
In an ideal world, we should just get rid of the latter two and only use
s_ttyp, but this makes certian pieces of code very impractical (e.g.
devfs, kern_exit.c).

Reported by:	Many people
2009-12-19 18:42:12 +00:00
Edward Tomasz Napierala
28d3fd007e Interpret VAPPEND correctly in vaccess_acl_nfs4(9). 2009-12-19 11:41:52 +00:00
Ed Schouten
e6d84d057a Make the wchan names of pts(4) fit in top(1).
Just like a similar change we made to the TTY code about half a year
ago, make these strings look similar.

Suggested by:	Jille Timmermans <jille@quis.cx>
2009-12-18 20:11:29 +00:00
Andrew Thompson
2b54315009 If the runcount is non-zero in eventhandler_deregister() then one or more
threads are executing the eventhandler, sleep in this case to make it safe for
module unload. If the runcount was up then an entry would have been marked
EHE_DEAD_PRIORITY so use this as a trigger to do the wakeup in
eventhandler_prune_list().

Reviewed by:	jhb
2009-12-17 21:17:13 +00:00
Matt Jacob
e7d829a46c Fix argument order in a call to mtx_init.
MFC after:	1 week
2009-12-17 00:22:56 +00:00
Luigi Rizzo
20c510f826 Properly fix callout handling by putting all the per-cpu info in
struct callout_cpu. From the comment in the file:

+ * There is one struct callout_cpu per cpu, holding all relevant
+ * state for the callout processing thread on the individual CPU.
+ * In particular:
+ *     cc_ticks is incremented once per tick in callout_cpu().
+ *     It tracks the global 'ticks' but in a way that the individual
+ *     threads should not worry about races in the order in which
+ *     hardclock() and hardclock_cpu() run on the various CPUs.
+ *     cc_softclock is advanced in callout_cpu() to point to the
+ *     first entry in cc_callwheel that may need handling. In turn,
+ *     a softclock() is scheduled so it can serve the various entries i
+ *     such that cc_softclock <= i <= cc_ticks .

Together with a smaller patch committed in september, this fixes a
bug that affects 8.0 with apps that rely on callouts to fire exactly
in the number of ticks specified (qemu among them).
Right now, callouts in 8.0 fire one tick late.

This was discussed in september with JeffR and jhb

MFC after:	3 days
2009-12-14 12:23:46 +00:00
Bjoern A. Zeeb
de0bd6f76b Throughout the network stack we have a few places of
if (jailed(cred))
left.  If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.

Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.

We cannot change the classic jailed() call to do that,  as it is used
outside the network stack as well.

Discussed with:	julian, zec, jamie, rwatson (back in Sept)
MFC after:	5 days
2009-12-13 13:57:32 +00:00
Attilio Rao
2028867def In current code, threads performing an interruptible sleep (on both
sxlock, via the sx_{s, x}lock_sig() interface, or plain lockmgr), will
leave the waiters flag on forcing the owner to do a wakeup even when if
the waiter queue is empty.
That operation may lead to a deadlock in the case of doing a fake wakeup
on the "preferred" (based on the wakeup algorithm) queue while the other
queue has real waiters on it, because nobody is going to wakeup the 2nd
queue waiters and they will sleep indefinitively.

A similar bug, is present, for lockmgr in the case the waiters are
sleeping with LK_SLEEPFAIL on.  In this case, even if the waiters queue
is not empty, the waiters won't progress after being awake but they will
just fail, still not taking care of the 2nd queue waiters (as instead the
lock owned doing the wakeup would expect).

In order to fix this bug in a cheap way (without adding too much locking
and complicating too much the semantic) add a sleepqueue interface which
does report the actual number of waiters on a specified queue of a
waitchannel (sleepq_sleepcnt()) and use it in order to determine if the
exclusive waiters (or shared waiters) are actually present on the lockmgr
(or sx) before to give them precedence in the wakeup algorithm.
This fix alone, however doesn't solve the LK_SLEEPFAIL bug. In order to
cope with it, add the tracking of how many exclusive LK_SLEEPFAIL waiters
a lockmgr has and if all the waiters on the exclusive waiters queue are
LK_SLEEPFAIL just wake both queues.

The sleepq_sleepcnt() introduction and ABI breakage require
__FreeBSD_version bumping.

Reported by:	avg, kib, pho
Reviewed by:	kib
Tested by:	pho
2009-12-12 21:31:07 +00:00
John Baldwin
42a346fa63 For some buses, devices may have active resources assigned even though they
are not allocated by the device driver.  These resources should still appear
allocated from the system's perspective so that their assigned ranges are
not reused by other resource requests.  The PCI bus driver has used a hack
to effect this for a while now where it uses rman_set_device() to assign
devices to the PCI bus when they are first encountered and later assigns
them to the actual device when a driver allocates a BAR.  A few downsides of
this approach is that it results in somewhat confusing devinfo -r output as
well as not being very easily portable to other bus drivers.

This commit adds generic support for "reserved" resources to the resource
list API used by many bus drivers to manage the resources of child devices.
A resource may be reserved via resource_list_reserve().  This will allocate
the resource from the bus' parent without activating it.
resource_list_alloc() recognizes an attempt to allocate a reserved resource.
When this happens it activates the resource (if requested) and then returns
the reserved resource.  Similarly, when a reserved resource is released via
resource_list_release(), it is deactivated (if it is active) and the
resource is then marked reserved again, but is left allocated from the
bus' parent.  To completely remove a reserved resource, a bus driver may
use resource_list_unreserve().  A bus driver may use resource_list_busy()
to determine if a reserved resource is allocated by a child device or if
it can be unreserved.

The PCI bus driver has been changed to use this framework instead of
abusing rman_set_device() to keep track of reserved vs allocated resources.

Submitted by:	imp (an older version many moons ago)
MFC after:	1 month
2009-12-09 21:52:53 +00:00
Edward Tomasz Napierala
9d7031a6d6 Don't add VAPPEND if the file is not being opened for writing. Note that this
only affects cases where open(2) is being used improperly - i.e. when the user
specifies O_APPEND without O_WRONLY or O_RDWR.

Reviewed by:	rwatson
2009-12-08 20:47:10 +00:00
Konstantin Belousov
4f17d481ed Remove wrong assertion. Debugee is allowed to lose a signal.
Reported and tested by:	jh
MFC after:	2 weeks
2009-12-03 20:16:59 +00:00
Edward Tomasz Napierala
6bb58cdd0f Add change that was somehow missed in r192586. It could manifest by
incorrectly returning EINVAL from acl_valid(3) for applications linked
against pre-8.0 libc.
2009-12-03 13:29:24 +00:00
Ed Schouten
6eaf04022b Don't allocate an input buffer for a TTY when the receiver is turned off.
When the termios CREAD flag is not set, it makes little sense to
allocate an input buffer. Just set the size to 0 in this case to reduce
memory footprint.

Disallow CREAD to be disabled for pseudo-devices to prevent
foot-shooting.
2009-12-01 19:14:57 +00:00
Alan Cox
a6d42a0d62 Replace VM_PROT_OVERRIDE_WRITE by VM_PROT_COPY. VM_PROT_OVERRIDE_WRITE has
represented a write access that is allowed to override write protection.
Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into
text pages.  Text pages are not just write protected but they are also
copy-on-write.  VM_PROT_OVERRIDE_WRITE overrides the write protection on the
text page and triggers the replication of the page so that the breakpoint
will be written to a private copy.  However, here is where things become
confused.  It is the debugger, not the process being debugged that requires
write access to the copied page.  Nonetheless, the copied page is being
mapped into the process with write access enabled.  In other words, once the
debugger sets a breakpoint within a text page, the program can write to its
private copy of that text page.  Whereas prior to setting the breakpoint, a
SIGSEGV would have occurred upon a write access.  VM_PROT_COPY addresses
this problem.  The combination of VM_PROT_READ and VM_PROT_COPY forces the
replication of a copy-on-write page even though the access is only for read.
Moreover, the replicated page is only mapped into the process with read
access, and not write access.

Reviewed by:	kib
MFC after:	4 weeks
2009-11-26 05:16:07 +00:00
Ivan Voras
cbc4ea28e2 Make ULE process usage (%CPU) accounting usable again by keeping track
of the last tick we incremented on.

Submitted by:	matthew.fleming/at/isilon.com, is/at/rambler-co.ru
Reviewed by:	jeff (who thinks there should be a better way in the future)
Approved by:	gnn (mentor)
MFC after:	3 weeks
2009-11-24 19:57:41 +00:00
Konstantin Belousov
080136212f On the return path from F_RDAHEAD and F_READAHEAD fcntls, do not
unlock Giant twice.

While there, bring conditions in the do/while loops closer to style,
that also makes the lines fit into 80 columns.

Reported and tested by:	dougb
2009-11-20 22:22:53 +00:00
Jaakko Heinonen
10d843a446 Extend ddb(4) "show mount" command to print active string mount options.
Note that only option names are printed, not values.

Reviewed by:	pjd
Approved by:	trasz (mentor)
MFC after:	2 weeks
2009-11-19 14:33:03 +00:00
Oleksandr Tymoshenko
3ea6157e6b - Unbreak build with KLD_DEBUG defined
- Add debug.kld_debug sysctl to control KLD debugging level
- Print information about KLD dependencies with debug enabled
2009-11-17 21:56:12 +00:00
Konstantin Belousov
a3de221dbe Among signal generation syscalls, only sigqueue(2) is allowed by POSIX
to fail due to lack of resources to queue siginfo. Add KSI_SIGQ flag
that allows sigqueue_add() to fail while trying to allocate memory for
new siginfo. When the flag is not set, behaviour is the same as for
KSI_TRAP: if memory cannot be allocated, set bit in sq_kill. KSI_TRAP is
kept to preserve KBI.

Add SI_KERNEL si_code, to be used in siginfo.si_code when signal is
generated by kernel. Deliver siginfo when signal is generated by kill(2)
family of syscalls (SI_USER with properly filled si_uid and si_pid), or
by kernel (SI_KERNEL, mostly job control or SIGIO). Since KSI_SIGQ flag
is not set for the ksi, low memory condition cause old behaviour.

Keep psignal(9) KBI intact, but modify it to generate SI_KERNEL
si_code. Pgsignal(9) and gsignal(9) now take ksi explicitely. Add
pksignal(9) that behaves like psignal but takes ksi, and ddb kill
command implemented as pksignal(..., ksi = NULL) to not do allocation
while in debugger.

While there, remove some register specifiers and use ANSI C prototypes.

Reviewed by:	davidxu
MFC after:	1 month
2009-11-17 11:39:15 +00:00
Xin LI
1a9d4dda9b Revert revision 199201 for now as it has introduced a kernel vulnerability
and requires more polishing.
2009-11-12 19:02:10 +00:00
Attilio Rao
d113304956 Add the possibility for vfs.root.mountfrom tunable to accept a list of
items rather than a single one. The list is a space separated collection
of items defined as the current one accepted.

While there fix also a nit in a comment.

Obtained from:	Sandvine Incorporated
Reviewed by:	emaste
Tested by:	Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
Sponsored by:	Sandvine Incorporated
MFC:		2 weeks
2009-11-12 15:59:05 +00:00
Attilio Rao
023c800576 The building the dev nameunit string, in devclass_add_device() is based
on the assumption that the unit linked with the device is invariant but
that can change when calling devclass_alloc_unit() (because -1 is passed
or, more simply, because the unit choosen is beyond the table limits).
This results in a completely bogus string building.

Fix this by reserving the necessary room for all the possible characters
printable by a positive integer (we do not allow for negative unit
number).

Reported by:	Sandvine Incorporated
Reviewed by:	emaste
Sponsored by:	Sandvine Incorporated
MFC:		1 week
2009-11-12 00:52:14 +00:00
Xin LI
41c8c6e876 Add interface description capability as inspired by OpenBSD.
MFC after:	3 months
2009-11-11 21:30:58 +00:00
Edward Tomasz Napierala
23e62e7654 Revert r198873. Having different VAPPEND semantics for VOP_ACCESS(9)
and VOP_ACCESSX(9) is not a good idea.
2009-11-11 13:49:22 +00:00
Konstantin Belousov
88e6f61a19 When rename("a", "b/.") is performed, target namei() call returns
dvp == vp. Rename syscall does not check for the case, and at least
ufs_rename() cannot deal with it. POSIX explicitely requires that both
rename(2) and rmdir(2) return EINVAL when any of the pathes end in "/.".

Detect the slashdot lookup for RENAME or REMOVE in lookup(), and return
EINVAL.

Reported by:	Jim Meyering <jim meyering net>
Tested by:	simon, pho
MFC after:	1 week
2009-11-10 11:50:37 +00:00
Konstantin Belousov
75c586a4c8 In r198506, kern_sigsuspend() started doing cursig/postsig loop to make
sure that a signal was delivered to the thread before returning from
syscall. Signal delivery puts new return frame on the user stack, and
modifies trap frame to enter signal handler. As a consequence, syscall
return code sets EINTR as error return for signal frame, instead of the
syscall return.

Also, for ia64, due to different registers layout for those two kind of
frames, usermode sigsegfaulted when returned from signal handler.

Use newly-introduced cpu_set_syscall_retval(9) to set syscall result,
and return EJUSTRETURN from kern_sigsuspend() to prevent syscall return
code from modifying this frame [1].

Another issue is that pending SIGCONT might be cancelled by SIGSTOP,
causing postsig() not to deliver any catched signal [2]. Modify
postsig() to return 1 if signal was posted, and 0 otherwise, and use
this in the kern_sigsuspend loop.

Proposed by:	marcel [1]
Noted by:	davidxu [2]
Reviewed by:	marcel, davidxu
MFC after:	1 month
2009-11-10 11:46:53 +00:00
Edward Tomasz Napierala
7ba889b8f4 Add suggestion for zfs root. 2009-11-08 09:54:25 +00:00
Attilio Rao
337c5ff4fc Save the sack when doing a lockmgr_disown() call.
Requested by:	kib
MFC:		3 days
2009-11-06 22:33:03 +00:00
Edward Tomasz Napierala
7a172809b5 Fix build.
Submitted by:	Andrius Morkūnas <hinokind at gmail.com>
2009-11-04 08:25:58 +00:00
Edward Tomasz Napierala
3b1b09809f Revert r198874, pending further discussion. 2009-11-04 07:14:16 +00:00
Edward Tomasz Napierala
da9ce28ecb Style fixes. 2009-11-04 07:04:15 +00:00
Edward Tomasz Napierala
8fafa5cecf Make sure we don't end up with VAPPEND without VWRITE, if someone calls open(2)
like this: open(..., O_APPEND).
2009-11-04 06:48:34 +00:00
Edward Tomasz Napierala
597954c813 While VAPPEND without VWRITE makes sense for VOP_ACCESSX(9) (e.g. to check
for the permission to create subdirectory (ACE4_ADD_SUBDIRECTORY)), it doesn't
really make sense for VOP_ACCESS(9).  Also, many VOP_ACCESS(9) implementations
don't expect that.  Make sure we don't confuse them.
2009-11-04 06:47:14 +00:00
Ed Schouten
ca1d2f657a Make /dev/klog and kern.msgbuf* MPSAFE.
Normally msgbufp is locked using Giant. Switch it to use the
msgbuf_lock. Instead of changing the tsleep() calls to msleep(), just
convert it to condvar(9).

In my opinion the locking around msgbuf_peekbytes() still remains
questionable. It looks like locks are dropped while performing copies of
multiple blocks to userspace, which may cause the msgbuf to be reset in
the mean time. At least getting it underneath from Giant should make it
a little easier for us to figure out how to solve that.

Reminded by:	rdivacky
2009-11-03 21:06:19 +00:00
Attilio Rao
1b9d701fee Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by:	jeff (originally), kris (currently)
Reviewed by:	jhb
Tested by:	Giuseppe Cocomazzi <sbudella at email dot it>
2009-11-03 16:46:52 +00:00
Konstantin Belousov
1c89fc757a If socket buffer space appears to be lower then sum of count of already
prepared bytes and next portion of transfer, inner loop of kern_sendfile()
aborts, not preparing next mbuf for socket buffer, and not modifying
any outer loop invariants. The thread loops in the outer loop forever.

Instead of breaking from inner loop, prepare only bytes that fit into
the socket buffer space.

In collaboration with:	pho
Reviewed by:	bz
PR:	kern/138999
MFC after:	2 weeks
2009-11-03 12:52:35 +00:00
Konstantin Belousov
80a8b0f3bf Trapsignal() and postsig() call kern_sigprocmask() with both process
lock and curproc->p_sigacts->ps_mtx. Reschedule_signals may need to have
ps_mtx locked to decide and wakeup a thread, causing recursion on the
mutex.

Inform kern_sigprocmask() and reschedule_signals() about lock state
of the ps_mtx by new flag SIGPROCMASK_PS_LOCKED to avoid recursion.

Reported and tested by:	keramida
MFC after:	1 month
2009-10-30 10:10:39 +00:00
Konstantin Belousov
8084540253 Trapsignal() calls kern_sigprocmask() when delivering catched signal
with proc lock held.

Reported and tested by:	Mykola Dzham  freebsd at levsha org ua
MFC after:	1 month
2009-10-29 14:34:24 +00:00
Konstantin Belousov
7415a41f4a Fix style issue. 2009-10-29 10:03:08 +00:00
Konstantin Belousov
17c974499c Regenerate 2009-10-27 11:01:15 +00:00
Konstantin Belousov
066d836b02 Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by:	davidxu
Tested by:	pho
MFC after:	1 month
2009-10-27 10:55:34 +00:00
Konstantin Belousov
d6e029adbe In r197963, a race with thread being selected for signal delivery
while in kernel mode, and later changing signal mask to block the
signal, was fixed for sigprocmask(2) and ptread_exit(3). The same race
exists for sigreturn(2), setcontext(2) and swapcontext(2) syscalls.

Use kern_sigprocmask() instead of direct manipulation of td_sigmask to
reschedule newly blocked signals, closing the race.

Reviewed by:	davidxu
Tested by:	pho
MFC after:	1 month
2009-10-27 10:47:58 +00:00
Konstantin Belousov
84440afb54 In kern_sigsuspend(), better manipulate thread signal mask using
kern_sigprocmask() to properly notify other possible candidate threads
for signal delivery.

Since sigsuspend() shall only return to usermode after a signal was
delivered, do cursig/postsig loop immediately after waiting for
signal, repeating the wait if wakeup was spurious due to race with
other thread fetching signal from the process queue before us. Add
thread_suspend_check() call to allow the thread to be stopped or killed
while in loop.

Modify last argument of kern_sigprocmask() from boolean to flags,
allowing the function to be called with locked proc. Convertion of the
callers that supplied 1 to the old argument will be done in the next
commit, and due to SIGPROCMASK_OLD value equial to 1, code is formally
correct in between.

Reviewed by:	davidxu
Tested by:	pho
MFC after:	1 month
2009-10-27 10:42:24 +00:00
John Baldwin
86855bf549 Another nit that both I and ispell missed.
Submitted by:	Ben Kaduk  minimarmot of gmail
2009-10-26 18:32:06 +00:00
John Baldwin
9390262576 Fix some spelling nits. 2009-10-26 17:42:03 +00:00
Joseph Koshy
16d95d4f92 Inform hwpmc(4) of a thread's impending demise prior to invoking sched_throw().
Debugging help:		fabient
Review and testing by:	fabient
2009-10-25 04:34:47 +00:00
Alan Cox
a0c703bf21 Update a comment to reflect the previous change. 2009-10-25 02:48:29 +00:00
Ruslan Ermilov
4d9d1e823c - Rename tunable kern.ipc.shmmaxpgs to kern.ipc.shmall.
- Explain the fuss when initializing shmmax.

PR:	75542 (mistakenly closed instead of PR 75541)
2009-10-24 19:00:58 +00:00
John Baldwin
5ca4819ddf - Fix several off-by-one errors when using MAXCOMLEN. The p_comm[] and
td_name[] arrays are actually MAXCOMLEN + 1 in size and a few places that
  created shadow copies of these arrays were just using MAXCOMLEN.
- Prefer using sizeof() of an array type to explicit constants for the
  array length in a few places.
- Ensure that all of p_comm[] and td_name[] is always zero'd during
  execve() to guard against any possible information leaks.  Previously
  trailing garbage in p_comm[] could be leaked to userland in ktrace
  record headers via td_name[].

Reviewed by:	bde
2009-10-23 15:14:54 +00:00
John Baldwin
4f9d48e478 Don't bother copying the name of a kproc or kthread out into a temporary
array just to pass that array to printf().  kproc and kthread names are
NUL-terminated and can be printed using printf() directly.

Reviewed by:	bde
2009-10-23 15:09:51 +00:00
John Baldwin
3eec6f034a Set the devclass_t pointer specified in the DRIVER_MODULE() macro
sooner so it is always valid when a driver's identify routine is
called.  Previously, new-bus would attempt to create the devclass for
a newly loaded driver in two separate places, once in
devclass_add_driver(), and again after devclass_add_driver() returned
in driver_module_handler().  Only the second lookup attempted to set a
device class' parent and set the devclass_t pointer specified in the
DRIVER_MODULE() macro.  However, by the time it was executed, the
driver was already added to existing instances of the parent driver at
which point in time the new driver's identify routine would have been
invoked.  The fix is to merge the two attempts and only create the
devclass once in devclass_add_driver() including setting the
devclass_t pointer passed to DRIVER_MODULE() before the driver is
added to any existing bus devices.

Reported by:	avg
Reviewed by:	imp
MFC after:	2 weeks
2009-10-22 14:53:44 +00:00
Marcel Moolenaar
1a4fcaebe3 o Introduce vm_sync_icache() for making the I-cache coherent with
the memory or D-cache, depending on the semantics of the platform.
    vm_sync_icache() is basically a wrapper around pmap_sync_icache(),
    that translates the vm_map_t argumument to pmap_t.
o   Introduce pmap_sync_icache() to all PMAP implementation. For powerpc
    it replaces the pmap_page_executable() function, added to solve
    the I-cache problem in uiomove_fromphys().
o   In proc_rwmem() call vm_sync_icache() when writing to a page that
    has execute permissions. This assures that when breakpoints are
    written, the I-cache will be coherent and the process will actually
    hit the breakpoint.
o   This also fixes the Book-E PMAP implementation that was missing
    necessary locking while trying to deal with the I-cache coherency
    in pmap_enter() (read: mmu_booke_enter_locked).

The key property of this change is that the I-cache is made coherent
*after* writes have been done. Doing it in the PMAP layer when adding
or changing a mapping means that the I-cache is made coherent *before*
any writes happen. The difference is key when the I-cache prefetches.
2009-10-21 18:38:02 +00:00
Ruslan Ermilov
e64585bdc2 Random number generator initialization cleanup:
- Introduce new SI_SUB_RANDOM point in boot sequence to make it
clear from where one may start using random(9).  It should be as
early as possible, so place it just after SI_SUB_CPU where we
have some randomness on most platforms via get_cyclecount().

- Move stack protector initialization to be after SI_SUB_RANDOM
as before this point we have no randomness at all.  This fixes
stack protector to actually protect stack with some random guard
value instead of a well-known one.

Note that this patch doesn't try to address arc4random(9) issues.
With current code, it will be implicitly seeded by stack protector
and hence will get the same entropy as random(9).  It will be
securely reseeded once /dev/random is feeded by some entropy from
userland.

Submitted by:	Maxim Dounin <mdounin@mdounin.ru>
MFC after:	3 days
2009-10-20 16:36:51 +00:00
Ed Schouten
6015f6f35a Properly set the low watermarks when reducing the baud rate.
Now that buffers are deallocated lazily, we should not use
tty*q_getsize() to obtain the buffer size to calculate the low
watermarks. Doing this may cause the watermark to be placed outside the
typical buffer size.

This caused some regressions after my previous commit to the TTY code,
which allows pseudo-devices to resize the buffers as well.

Reported by:	yongari, dougb
MFC after:	1 week
2009-10-19 07:17:37 +00:00
Ed Schouten
5ed8d12443 Allow the buffer size to be configured for pseudo-like TTY devices.
Devices that don't implement param() (which means they don't support
hardware parameters such as flow control, baud rate) hardcode the baud
rate to TTYDEF_SPEED. This means the buffer size cannot be configured,
which is a little inconvenient when using canonical mode with big lines
of input, etc.

Make it adjustable, but do clamp it between B50 and B115200 to prevent
awkward buffer sizes. Remove the baud rate assignment from
/etc/gettytab. Trust the kernel to fill in a proper value.

Reported by:	Mikolaj Golub <to my trociny gmail com>
MFC after:	1 month
2009-10-18 19:48:53 +00:00
Ed Schouten
99087885be Make lock devices work properly.
It turned out I did add the code to use the init state devices to set
the termios structure when opening the device, but it seems I totally
forgot to add the bits required to force the actual locking of flags
through the lock state devices.

Reported by:	ru
MFC after:	1 week (to be discussed)
2009-10-18 19:45:44 +00:00
Konstantin Belousov
7564c4ad9a If ET_DYN binary has non-zero base address for some reason, honour it
and do not relocate the binary to ET_DYN_LOAD_ADDR. This allows for the
binary author to influence address map of the process. In particular,
when the binary is actually an interpeter, this allows to have almost
usual process address map.

Communicate the relocation bias of the mapping for interpeter-less
ET_DYN binary, that is interperter itself, in AT_BASE aux entry. This
way, rtld is able to find its dynamic structure and relocate itself.
Note that mapbase in the rtld is still wrong and requires further
fixing.

Reported and tested by:	rwatson
Discussed with:	kan
MFC after:	3 days
2009-10-18 12:57:48 +00:00
Ed Schouten
39410373b3 Print backspaces after echoing an EOF.
Applications like shells expect EOF to give no graphical output, while
our implementation prints ^D by default (tunable with stty echoctl).
Make the new implementation behave like the old TTY code. Print two
backspaces afterwards.

Reported by:	koitsu
MFC after:	1 month
2009-10-17 08:59:41 +00:00
John Baldwin
d0c9a29169 Use language more closely resembling English in a panic message.
Pointy hat to:	jhb
Submitted by:	pluknet
2009-10-15 18:51:19 +00:00
John Baldwin
37b8ef16cd Add a facility for associating optional descriptions with active interrupt
handlers.  This is primarily intended as a way to allow devices that use
multiple interrupts (e.g. MSI) to meaningfully distinguish the various
interrupt handlers.
- Add a new BUS_DESCRIBE_INTR() method to the bus interface to associate
  a description with an active interrupt handler setup by BUS_SETUP_INTR.
  It has a default method (bus_generic_describe_intr()) which simply passes
  the request up to the parent device.
- Add a bus_describe_intr() wrapper around BUS_DESCRIBE_INTR() that supports
  printf(9) style formatting using var args.
- Reserve MAXCOMLEN bytes in the intr_handler structure to hold the name of
  an interrupt handler and copy the name passed to intr_event_add_handler()
  into that buffer instead of just saving the pointer to the name.
- Add a new intr_event_describe_handler() which appends a description string
  to an interrupt handler's name.
- Implement support for interrupt descriptions on amd64 and i386 by having
  the nexus(4) driver supply a custom bus_describe_intr method that invokes
  a new intr_describe() MD routine which in turn looks up the associated
  interrupt event and invokes intr_event_describe_handler().

Requested by:	many
Reviewed by:	scottl
MFC after:	2 weeks
2009-10-15 14:54:35 +00:00
John Baldwin
a0f1535205 Fix a sign bug in the handling of nice priorities when computing the
interactive score for a thread.

Submitted by:	Taku YAMAMOTO  taku of tackymt.homeip.net
Reviewed by:	jeff
MFC after:	3 days
2009-10-15 11:41:12 +00:00
Joseph Koshy
8b5adf9dd9 Improve the description of sysctl "kern.sugid_coredump".
Submitted by:	Mel Flynn <mel.flynn+fbsd.hackers at mailing.thruhere.net>
		on -hackers
2009-10-12 15:49:48 +00:00
Konstantin Belousov
7df8f6ab6f Fix typo.
Submitted by:	rdivacky
MFC after:	1 month
2009-10-12 10:09:48 +00:00
Konstantin Belousov
6b286ee8b5 Currently, when signal is delivered to the process and there is a thread
not blocking the signal, signal is placed on the thread sigqueue. If
the selected thread is in kernel executing thr_exit() or sigprocmask()
syscalls, then signal might be not delivered to usermode for arbitrary
amount of time, and for exiting thread it is lost.

Put process-directed signals to the process queue unconditionally,
selecting the thread to deliver the signal only by the thread returning
to usermode, since only then the thread can handle delivery of signal
reliably. For exiting thread or thread that has blocked some signals,
check whether the newly blocked signal is queued for the process, and
try to find a thread to wakeup for delivery, in reschedule_signal(). For
exiting thread, assume that all signals are blocked.

Change cursig() and postsig() to look both into the thread and process
signal queues. When there is a signal that thread returning to usermode
could consume, TDF_NEEDSIGCHK flag is not neccessary set now. Do
unlocked read of p_siglist and p_pendingcnt to check for queued signals.

Note that thread that has a signal unblocked might get spurious wakeup
and EINTR from the interruptible system call now, due to the possibility
of being selected by reschedule_signals(), while other thread returned
to usermode earlier and removed the signal from process queue. This
should not cause compliance issues, since the thread has not blocked a
signal and thus should be ready to receive it anyway.

Reported by:	Justin Teller <justin.teller gmail com>
Reviewed by:	davidxu, jilles
MFC after:	1 month
2009-10-11 16:49:30 +00:00
Konstantin Belousov
6728dc169d Refine r195509, instead of checking that vnode type is VBAD, that is
set quite late in the revocation path, properly verify that vnode is
not doomed before calling VOP.

Reported and tested by:	Harald Schmalzbauer <h.schmalzbauer omnilan de>
MFC after:	3 days
2009-10-10 21:17:30 +00:00
Konstantin Belousov
ab02d85f0d Map PIE binaries at non-zero base address.
Discussed with:	bz
Reviewed by:	kan
Tested by:	bz (i386, amd64), bsam (linux)
MFC after:	some time
2009-10-10 15:33:01 +00:00
Konstantin Belousov
5b33842a9b Do not map segments of zero length.
Discussed with:	bz
Reviewed by:	kan
Tested by:	bz (i386, amd64), bsam (linux)
MFC after:	some time
2009-10-10 15:28:52 +00:00
Konstantin Belousov
47d81c1b70 Postpone dropping fp till both kq_global and kqueue mutexes are
unlocked. fdrop() closes file descriptor when reference count goes to
zero. Close method for vnodes locks the vnode, resulting in "sleepable
after non-sleepable". For pipes, pipe mutex is before kqueue lock,
causing LOR.

Reported and tested by:	pho
MFC after:	2 weeks
2009-10-10 14:56:34 +00:00
Robert Watson
604f19c91e Fix build on amd64, where sysctl arg1 is a pointer.
Reported by:	Mr Tinderbox
MFC after:	3 months
2009-10-05 22:23:12 +00:00
Edward Tomasz Napierala
d5e0b21541 Fix NFSv4 ACLs on sparc64. Turns out that fuword(9) fetches 64 bits
instead of sizeof(int), and on sparc64 that resulted in fetching wrong
value for acl_maxcnt, which in turn caused __acl_get_link(2) to fail
with EINVAL.

PR:		sparc64/139304
Submitted by:	Dmitry Afanasiev <KOT at MATPOCKuH.Ru>
2009-10-05 19:56:56 +00:00
Robert Watson
84d61770bc First cut at implementing SOCK_SEQPACKET support for UNIX (local) domain
sockets.  This allows for reliable bi-directional datagram communication
over UNIX domain sockets, in contrast to SOCK_DGRAM (M:N, unreliable) or
SOCK_STERAM (bi-directional bytestream).  Largely, this reuses existing
UNIX domain socket code.  This allows applications requiring record-
oriented semantics to do so reliably via local IPC.

Some implementation notes (also present in XXX comments):

- Currently we lack an sbappend variant able to do datagrams and control
  data without doing addresses, so we mark SOCK_SEQPACKET as PR_ADDR.
  Adding a new variant will solve this problem.

- UNIX domain sockets on FreeBSD provide back-pressure/flow control
  notification for stream sockets by manipulating the send socket
  buffer's size during pru_send and pru_rcvd.  This trick works less well
  for SOCK_SEQPACKET as sosend_generic() uses sb_hiwat not just to
  manage blocking, but also to determine maximum datagram size.  Fixing
  this requires rethinking how back-pressure is done for SOCK_SEQPACKET;
  in the mean time, it's possible to get EMSGSIZE when buffers fill,
  instead of blocking.

Discussed with:	benl
Reviewed by:	bz, rpaulo
MFC after:	3 months
Sponsored by:	Google
2009-10-05 14:49:16 +00:00
Attilio Rao
7f9f80ce03 When releasing a lockmgr held in shared way we need to use a write memory
barrier in order to avoid, on architectures which doesn't have strong
ordered writes, CPU instructions reordering.

Diagnosed by:	fabio
2009-10-03 15:02:55 +00:00
Bjoern A. Zeeb
925c8b5b04 Print a warning in case we cannot add more brandinfo because
we would overflow the MAX_BRANDS sized array.

Reviewed by:	kib
MFC After:	1 month
2009-10-03 10:50:00 +00:00
Robert Watson
afd8e45b45 Don't comment on stream socket handling in sosend_dgram, since that's
not handled.

MFC after:	3 weeks
2009-10-02 21:31:15 +00:00
Bjoern A. Zeeb
878adb8517 Add a mitigation feature that will prevent user mappings at
virtual address 0, limiting the ability to convert a kernel
NULL pointer dereference into a privilege escalation attack.

If the sysctl is set to 0 a newly started process will not be able
to map anything in the address range of the first page (0 to PAGE_SIZE).
This is the default. Already running processes are not affected by this.

You can either change the sysctl or the tunable from loader in case
you need to map at a virtual address of 0, for example when running
any of the extinct species of a set of a.out binaries, vm86 emulation, ..
In that case set security.bsd.map_at_zero="1".

Superseeds:		r197537
In collaboration with:	jhb, kib, alc
2009-10-02 17:48:51 +00:00
Ed Maste
e380cc73ed In fill_kinfo_thread, copy the thread's name into struct kinfo_proc even
if it is empty.  Otherwise the previous thread's name would remain in the
struct and then be reported for this thread.

Submitted by:	Ryan Stone
MFC after:	1 week
2009-10-01 21:44:30 +00:00
Edward Tomasz Napierala
2c29cfa083 Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both.  Note that
this commit makes implementation of either of these two mandatory.

Reviewed by:	kib
2009-10-01 17:22:03 +00:00
Konstantin Belousov
75ffdc4049 Do not dereference vp->v_mount without holding vnode lock and checking
that the vnode is not reclaimed.

Noted by:	Igor Sysoev <is rambler-co ru>
MFC after:	1 week
2009-10-01 12:50:26 +00:00
Konstantin Belousov
15b7a831df Fix typo.
MFC after:	3 days
2009-10-01 12:46:58 +00:00
Andriy Gapon
7e936cef34 print machine in kernel boot version string
Discussed with:	gavin, kib, jhb
PR:		kern/126926
MFC after:	2 weeks
2009-10-01 10:53:12 +00:00
Attilio Rao
ddce63ca73 When releasing a read/shared lock we need to use a write memory barrier
in order to avoid, on architectures which doesn't have strong ordered
writes, CPU instructions reordering.

Diagnosed by:	fabio
Reviewed by:	jhb
Tested by:	Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
2009-09-30 13:26:31 +00:00
Andriy Gapon
78edb09e67 print_caddr_t: drop incorrect __unused attribute from parameter
seems like a purely cosmetic change

Reviewed by:	jhb, kib
MFC after:	1 week
2009-09-30 11:14:13 +00:00
Robert Watson
718dbcfeaa Regenerate system call files following r197636. 2009-09-30 08:48:59 +00:00
Robert Watson
72c35fc69c Reserve system call numbers for Capsicum security framework capabilities,
capability mode, and process descriptors: cap_new, cap_getrights, cap_enter,
cap_getmode, pdfork, pdkill, pdgetpid, and pdwait.

Obtained from:	TrustedBSD Project
Sponsored by:	Google
MFC after:	3 weeks
2009-09-30 08:46:01 +00:00
Jamie Gritton
d446857747 Set the prison in NFS anon and GSS SVC creds.
Reviewed by:	marcel
MFC after:	3 days
2009-09-28 18:07:16 +00:00