One last post before Christmas. Lately I’ve been seeing lots of Linux admins coming to Solaris boxes and getting confused – unable to find the right commands, unaware of the Solaris tools, generally just struggling to get to grips with an unfamiliar operating environment.
I rate Solaris very highly – it comes with some amazing tools, and is superbly tuned to the needs of the enterprise. This isn’t always aligned with the needs of developers and hackers, though – but once you get comfortable with the tools in Solaris, you’ll start to wonder how you managed without them.
I was going to type up a nice long document, but then I remembered that Ben Rockwood, over at Cuddletech, had already written An Accelerated Introduction to Solaris 10.
It’s a good read, and should help bring anyone up to speed who has a Linux background. One thing – please please take the time to read up on RBAC (linked from Ben’s post). RBAC is infinitely more powerful than sudo, and you will find RBAC+LDAP a totally invaluable skill to use within the datacentre.
So, as covered in a previous post, Solaris Flash Archives give us a nice way to image a Solaris installation, and then to use that to build a machine via Jumpstart.
The process isn’t all one way, however, and sometimes you’ll want to have the ability to pull apart a flar and see what’s inside. Case in point: trying to debug some Jumpstart issues for a client, where some odd configuration was being set. It wasn’t being set during Jumpstart, and it wasn’t being set during the application install.
This just left the flar as being a possible culprit – but how to pull out a single file to check?
A Solaris Flash Archive is just a cpio archive, which means we can use the cpio command to play around with it. However, flars have some padding and extra sections – if you directly try to use cpio on it, you’ll get a lot of errors about ‘skipped XXX bytes of junk’.
We first need to pull apart the flar into archive, header, etc. sections – and we can do this directly with the flar command:
grond # cd /var/tmp
grond # mkdir flar_hacking
grond # cd flar_hacking/
grond # flar split /export/install/flars/sol9_0905_sun4u.flar
grond # ls -l
total 3907142
-rw-r--r-- 1 root root 1999449088 Dec 15 17:14 archive
-rw-r--r-- 1 root root 18 Dec 15 17:12 cookie
-rw-r--r-- 1 root root 461 Dec 15 17:12 identification
-rw-r--r-- 1 root root 4334 Dec 15 17:12 postdeployment
-rw-r--r-- 1 root root 1339 Dec 15 17:12 predeployment
-rw-r--r-- 1 root root 898 Dec 15 17:12 reboot
-rw-r--r-- 1 root root 53 Dec 15 17:12 summary
We can see that the flar split command has given us our archive, which is where all the files actually are, as well as the other extra sections which make the flar more than just a cpio archive.
Now that it’s split up, we can use cpio directly. In this case, I want to check to see if /etc/default/init is in the flar:
grond # cpio -it < archive | grep etc/default/init
etc/default/init
3905174 blocks
And there it is - so now we can use cpio again to extract the file:
grond # cpio -ivdm etc/default/init < archive
etc/default/init
cpio will extract the file, but relatively to your working directory, and not the root, so we won't be in danger of overwriting anything important:
grond # ls -lR etc/
etc/:
total 2
drwxr-xr-x 2 root root 512 Dec 15 17:15 default
etc/default:
total 2
-r-xr-xr-x 1 root sys 490 Oct 5 2007 init
And there's the file we wanted, extracted from the relevant Solaris flar. In this particular instance, it was indeed responsible for the bogus configuration being pushed out.
I came across this particular issue for a client, and it turned out to be a harsh gotcha in Solaris 9.
Quick recap: SVM metasets are a group of disks (usually from a SAN) that have their own meta state databases. They grew out of Sun Cluster as a way to share storage between cluster nodes, using SVM, and have since become a really handy way of managing SAN volumes.
Anyway, Solaris 9 4/04 introduced the ability to have ‘autotake’ metasets. Basically, one host was the master, and it could automatically import and manage the metaset on boot. This was great, because it finally swept aside the last baggage of Sun Cluster, and meant you could have your metasets referenced in /etc/vfstab and mount them at boot – just like real disks.
And there was much rejoicing across the land.
In this particular case, there was a host running Solaris 9 (for client software reasons) which had many terabytes of SAN LUNs mounted as metasets. I say had because when it rebooted, the machine said it couldn’t autotake the disk set because it wasn’t the owner, before dropping to single user mode complaining it couldn’t check any of the filesystems.
Odd. A quick check from single user mode, and yes indeed – the metaset was configured for autotake, but the host wasn’t the owner. Comment the (many) filesystems out of /etc/vfstab, continue the boot, and check again once at run level 3. Hang on – now the host is the metaset owner.
Whisky Tango Foxtrot, over. A quick Google threw up far too many suggestions to hack the startup scripts so that the SVM daemons start before the filesystem mounts. Not a great idea.
A very quick dig through Sunsolve turned up Sun BugID 6276747 – “Auto-take fails to work with fabric disks”
Turns out that this is an issue with the Solaris 9 SAN Foundation Suite, and how the kernel initialises SAN fabric LUNs, as opposed to FC-AL LUNs.
Adding the following like to /etc/system:
set fcp:ssfcp_enable_auto_configuration = 1
Followed by a quick reboot later, and behold! metasets are imported and mounted correctly, no further problems. This appears to be purely an issue in Solaris 9, so apart from old client apps I’m hoping we can leave this one behind.
A fairly common problem with Solaris UFS filesystems is where df output is showing lots of free space, but you can’t actually write to the filesystem. Having been recently playing with multi-terabyte filesystems, and forcing these sort of issues for debugging, I thought I’d share some information about the tools you can use and what they can report.
As an example, let’s look at a 2TB filesystem:
[root@gollum:/] # df -kh
Filesystem size used avail capacity Mounted on
/dev/dsk/c9t60060E80141189000001118900001400d0s0
1.9T 532G 1.4T 28% /fatty
The first thing we can do is not only check the amount of free disk space, but also check inode usage:
df -F ufs -o i
[root@gollum:/] # df -F ufs -o i
Filesystem iused ifree %iused Mounted on
/dev/dsk/c9t60060E80141189000001118900001400d0s0
2096192 0 100% /fatty
If we have multi-terabyte filesystems, our number of bytes per inode (nbpi) could be set too high if we’re using lots of small files – in which case it’s very easy to run out of inodes. We can see on this filesystem that we’ve used up all our inodes. Trying to write to this filesystem will result in “No space left on device” error messages – which is always good for some head scratching fun, as we can see that we’ve got 1.4Tb of space free.
To get an idea of how inodes, block size and things have been specified we need to find out how the filesystem was built:
/usr/sbin/mkfs -m <disk_device>
I’ve wrapped the line here to make it a bit more readable, but here’s the output querying our full multi-terabyte filesystem.
[root@gollum:/] # /usr/sbin/mkfs -m /dev/dsk/c9t60060E80141189000001118900001400d0s0
mkfs -F ufs -o nsect=128,ntrack=48,bsize=8192,fragsize=8192,cgsize=143,free=1,rps=1,nbpi=1161051, \
opt=t,apc=0,gap=0,nrpos=1,maxcontig=128 /dev/dsk/c9t60060E80141189000001118900001400d0s0 4110401456
This will show the commands passed to mkfs when it created the filesystem, and we can get an idea of what parameters were specified when the filesystem was built.
Things we care about here are:
- fragsize – the smallest amount of disk space that can be allocated to a file. If we have loads of files smaller than 8kb, then this should be smaller than 8kb.
- nbpi – number of bytes per inode
- opt – how is filesystem performance being optimised? t means we’re optimising to spend the least time allocating blocks, and s means we’ll be minimising the space fragmentation on the disk
On a multiterabyte filesystem, nbpi cannot be set to less than 1mb, and fragsize will also be set to bsize. So we’d want to optimise for time as opposed to fragments, as we’ll only every allocate in 8kb blocks.
fstyp is the command we can use to do some really low-level querying of a UFS filesystem.
We can invoke it with:
fstyp -v <disk_device>
Make sure you pipe it through more, or redirect the output to a file, because there’s a lot of it. fstyp will report on the statistics of all the cylinder groups for a filesystem, but it’s really just the first section reported from the superblocks that we’re interested in.
[root@gollum:/] # fstyp -v /dev/dsk/c9t60060E80141189000001118900001400d0s0 | more
ufs
magic decade format dynamic time Fri Dec 5 17:26:27 2008
sblkno 2 cblkno 3 iblkno 4 dblkno 11
sbsize 8192 cgsize 8192 cgoffset 8 cgmask 0xffffffc0
ncg 4679 size 256900091 blocks 256857968
bsize 8192 shift 13 mask 0xffffe000
fsize 8192 shift 13 mask 0xffffe000
frag 1 shift 0 fsbtodb 4
minfree 1% maxbpg 2048 optim time
maxcontig 128 rotdelay 0ms rps 1
csaddr 11 cssize 81920 shift 9 mask 0xfffffe00
ntrak 48 nsect 128 spc 6144 ncyl 669011
cpg 143 bpg 54912 fpg 54912 ipg 448
nindir 2048 inopb 64 nspf 16
nbfree 187148663 ndir 2 nifree 0 nffree 0
cgrotor 462 fmod 0 ronly 0 logbno 23
version 1
fs_reclaim is not set
bsize and fsize show us the block and fragment size, respectively.
nbfree and nffree show us the number of free block and fragments, respectively. If nbfree is 0, you’re in trouble – no free blocks means no more writing to the filesystem, regardless of how much space is actually still available.
What usually happens when writing lots of small (ie. > 8kb) files to a filesystem is that the number of free blocks (nbfree) has fallen to 0, but you’ve got plenty of fragments left. If block size = fragment size, that’s not an issue – but if fragments are, say, 2kb, then you’re not going to be able to write to the filesystem any more (“file system full” error messages) even though df is showing lots of free disk space.
A big part of tuning your filesystem is knowing what’s going on it. For multi-terabyte filesystems, you should be placing larger files on there – so setting block size to equal fragment size won’t be wasting space.
If you’ve got lots of smaller files, you’ll need to think about what the average filesize is – if it’s less than 8kb, you’ll want to make sure that fragment size is also less than 8kb. Otherwise you’ll be wasting space by writing 8kb blocks all the time when you could get away with 2kb fragments.
Anyway, back to the problem at hand – our 2Tb filesystem that’s run out of inodes. In this particular case, we’ll need to rebuild the filesystem and allocate more inodes. The question is – how do we work out what the value should be?
This simple shell script will analyse the files from the directory you execute it in, and will come back with the average file size:
#!/bin/sh
find . -type f -exec ls -l {} \; | \
awk 'BEGIN {tsize=0; fcnt=1;} \
{ printf("%03d File: %-060s size: %d bytes\n", fcnt++, $9, $5); \
tsize += $5; } \
END { printf("Total size = %d Average file size = %.02f\n", \
tsize, tsize/fcnt); } '
Running it we can see:
(lots of output)
....
Total size = 2147483647 Average file size = 258286.18
Now, if our average file size is 252k, then our inode density of 1161051 (1 inode per 1mb) is going to be hopelessly inadequate. This is born out by looking again at our df output – we can see that we’ve run out of inodes when the filesystem is only approximately a quarter full, which matches up to our average file size being a quarter of the inode density.
However, at this point, we’re stuffed – we can’t set nbpi to be less than 1mb on a Solaris UFS filesystem that’s larger than 1Tb. Our only options are:
- chop the filesystem up into smaller ones
- migrate to ZFS
- create bigger files ;-)
A constant problem when people write scripts is that you end up with loads of different log files scattered across the file system. This brings with it the associated pain of parsing the log files, archiving the old ones, etc. etc.
Wouldn’t it be great if you could get your scripts to log to syslog? Enter logger, which is present on pretty much all UNIX systems.
At the top of your script, after defining the Korn shell (you are writing in Korn, aren’t you? You do expect your scripts to work across more than one platform, don’t you?) you can add the simple construct:
logger -p daemon.notice -t ${0##*/}[$$] |&
exec >&p 2>&1
And behold! Magical script entries in syslog – in this example, from a script called test_script running on an Origin 200 called frith:
Nov 25 17:40:41 frith test_script[17449]: [ID 702911 daemon.notice] scripty logging goodness
The IRIX manpage for logger says:
Logger provides a shell command interface to the syslog(3B) system log
routine. It can log a message specified on the command line, from a
specified file, or from the standard input. Each line in the specified
file or standard input is logged separately.
The Solaris manpage is a bit more verbose:
The logger command provides a method for adding one-line
entries to the system log file from the command line. One or
more message arguments can be given on the command line, in
which case each is logged immediately. If this is unspeci-
fied, either the file indicated with -f or the standard
input is added to the log. Otherwise, a file can be specified, in which case each line in the file is logged. If neither is specified, logger reads and logs messages on a
line-by-line basis from the standard input.
However the important thing is that logger takes the same key options and works in the same way – giving you a simple, portable way to get syslog entries from your custom scripts, cross platform.