DHCP and djbdns

July 17th, 2008

DHCP assigns IPs to hosts; DNS reports IPs by hostname. The popular dhcp server is even written by the folks who brought you BIND.

But in our shop, we run djbdns.

I am not a full convert to the djb way. (We run postfix, for example.). I have never met the man in person. I would probably find him infuriating. But I find his claims about BIND and DNS convincing. And I’ve been using his software for the last eight years without a major glitch.

I selected djbdns when I was setting up DNS for my first real website. I had bought the DNS and BIND O’Reilly book and was puzzling through the man pages and just feeling stupider and stupider the more time I spent on it. I figured there must be a better way, so I looked around and found djbdns. The setup instructions were clear and simple. I followed them and everything worked. I felt smart and competent.

Users like software that makes them feel good. There’s a lesson for us all there, I suspect.

Anyway, one of the quirks of djbdns is it doesn’t honor UPDATE requests. Unsecure, you see. Which makes it hard to do DHCP….

Reporting Your Own CPU Time on Windows

July 12th, 2008

Unix has the ‘time’ command to report the running time of a process. Actually there’s both the time command, /bin/time or /usr/bin/time, provided by time-1.7 on my CentOS server, and the bash builtin command time. They have slightly different output formats but they both report the run time (aka wall clock time) and CPU time in both user mode and kernel mode:


[smikes@arthur ~]$ time ls >/dev/null

real    0m0.055s
user    0m0.000s
sys     0m0.004s
[smikes@arthur ~]$ /usr/bin/time ls >/dev/null
0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+253minor)pagefaults 0swaps
[smikes@arthur ~]$

Windows? No native equivalent. If you install Cygwin (which you should) then you get /usr/bin/time and bash-time as above. That’s handy for you, but unless you can guarantee your users will have Cygwin installed too, not something you can rely on.

Ages ago I had put in some trivial timing code into $CLIENT’s main application. It was just based on recording time() at program start and successful exit and reporting the run time in hours, minutes, and seconds from that. Some time later while my brain was thoroughly disengaged, I added another counter based on clock_t and, to distinguish it from the first one, I reported that as “CPU time”.

Now it’s several years later and I get the following exceedingly polite bug report:

Consider these two successive builds of ___, one without the error files and one with. The error files are significant, but not that significant.

ST:Total CPU Time: 1:39:12.42
ST:Total Wall Time: 1:39:14.

Second run, when I got the computer much busier doing test builds.

ST:Total CPU Time: 2:16:23.08
ST:Total Wall Time: 2:16:25.

The reported CPU time should not be significantly higher than the first run, but it is suspiciously close to 100% of the Wall Time. As I used a significant amount of CPU time on other processes, I know the reported CPU tim [sic] cannot be correct.

Busted! I must have been thinking that clock_t was magically going to give me a CPU time number since I associate clock_t with timing loops. But over the entire run time of the application, of course it will record all the time, both execution time and wait time.

After poking around for a while I rediscovered WMI, the Windows Management and Instrumentation service. Among other things it keeps per-process counters for kernel mode and user mode execution time. It comes with a wealth of baroque interfaces, including COM (of course), an interactive/batch command-line client (wmic.exe), and others. I didn’t want to add any more COM to this C application, so decided to try the command-line client first.

Here’s my first spike solution:


#include <stdio.h>
#include <process.h>

int main()
{
  FILE * pf = NULL;

  pf = _popen( "wmic process where Handle=5612 list FULL", "r" );

  while( !feof( pf ) )
  {
    char acLine[512];
    fgets( acLine, 512, pf );
    if( strncmp( acLine, “KernelModeTime”, 14 ) == 0 )
    {
       printf( “Kernel mode time: %.2f s\n”, atof(acLine+15)/1.0e6 );
    }
  }
}

Note that I’m just hardcoding a process ID in (that happened to be my MSDEV.exe process ID at that time, actually) and reading into a static fixed-size buffer. The wmic command line was developed by repeated hacking around with wmic /? and some judicious googling.

When I compiled this with MSVC (cl test.c => test.exe) and ran it under cygwin, I got a hard hang of my rxvt. But running it under cmd.exe worked fine, yielding a reasonable-enough looking number. Encouraged, I proceeded to integrate it into the application.

My first version was quite similar to the spike above, but with a longer buffer. This worked fine but gave nonsense results, reporting 222 seconds in user time when the total execution time was only 85 seconds. Apparently the WMI data is reported in tenths of microseconds instead of microseconds. The older documentation I initially found gave milliseconds as the unit for the script object interface, but more recent documentation says 100-ns units, as I found. Apparently the reasonable-seeming result I saw in my spike solution was ten times too high.

Then I decided to do away with the fgets(), static buffer, and strncmp and switched to reading into malloc’ed storage and using strtok(). That was all good and it ran beautifully in the debugger. But when I started the program from a cygwin command-line, I again got a hard hang, uninterruptible by Ctrl-C, leading to an rxvt crash.

It turns out that Windows _popen() and Cygwin just do not play nicely together. Furthermore, any moderately advanced Cygwin console (e.g., an rxvt) uses a pty device instead of the raw Windows console device, and ptys are confusing to the wmic command-line client. You can find the source material yourself on Google: http://www.google.com/search?q=wmic+cygwin and http://www.google.com/search?q=_popen+cygwin.

So I went to the age-old ugly solution of writing to a temporary file. My application runs in a dedicated directory and assumes it has exclusive control of the current directory, so I don’t protect against race conditions. But then I discovered that WMIC writes UTF-16 output which Windows had been recoding to my local code when I used _popen(). When dumped to a file, Windows left it alone and I became responsible for recoding it myself (or using it as-is). So I switched to the ‘wcs’ versions of the string functions I was using, and used our utility function read_lineW to read a line from the file into malloc’ed storage.

I think you need administrator access to read WMI performance counters (or maybe just to run WMIC?), so admin access is necessary. Since it depends on hardcoded localizable constants (UserModeTime etc.) it probably won’t run on a localized Windows where wmic.exe displays BenutzerModeTime or whatever. But it works for my purposes, and maybe for yours too:

Here’s the final code:



char * format_hmsd( double dSeconds )
{
	static char acBuffer[32] = {0};

	int nSeconds = (int)dSeconds;
	int nMinutes = nSeconds / 60;
	double dFrac = dSeconds - (nMinutes * 60);

	int nHours = nMinutes / 60;
	int nMinRemainder = nMinutes % 60;

	sprintf( acBuffer, “%2d:%02d:%05.2f”, nHours, nMinRemainder, dFrac );

	return acBuffer;
}

void wmi_time( double * pdKernel, double * pdUser )
{
	FILE * pf = NULL;
	char acFile[32] = {0};
	char acCmd[128] = {0};
	wchar_t * psLine = NULL;
	DWORD nID = GetCurrentProcessId();

	*pdKernel = *pdUser = 0.0;

	sprintf( acFile, “wmi%04d.out”, nID );
	sprintf( acCmd, “wmic process where handle=%d list FULL %s”, nID, acFile );
	system( acCmd );

	pf = fopen( acFile, “rb” );

	for( psLine = read_lineW( pf ); psLine != NULL; free( psLine ), psLine = read_lineW( pf ) )
	{
		wchar_t * psKey = wcstok( psLine, L”=” );
		wchar_t * psValue = wcstok( NULL, L”" );

		if( psKey != NULL )
		{
			if( wcscmp( psKey, L”KernelModeTime” ) == 0 )
				*pdKernel = wcstod( psValue, NULL ) / 1.0e7;
			else if( wcscmp( psKey, L”UserModeTime” ) == 0 )
				*pdUser = wcstod( psValue, NULL ) / 1.0e7;
		}

	}

	fclose( pf );
	remove( acFile );
}

/* Used like this:
  double dKernel, dUser;
  double dTotal = (clock() - clkStart+0.0)/CLOCKS_PER_SEC;
  wmi_time( &dKernel, &dUser );

  logmsg( “ST:Total Wall Time  : %s.\n”, format_hmsd( dSeconds ) );
  if( dUser != 0.0 || dKernel != 0.0 )
  {
	logmsg( “ST:Total I/O Time   : %s.\n”, format_hmsd( dSeconds - dUser - dKernel ) );
	logmsg( “ST:Total CPU Time   : %s.\n”, format_hmsd( dKernel+dUser ) );
	logmsg( “ST:Total Kernel Time: %s.\n”, format_hmsd( dKernel ) );
	logmsg( “ST:Total User Time  : %s.\n”, format_hmsd( dUser ) );
}
 */

Does this matrix make me look sheared?

July 9th, 2008

A customer recently had a problem with the way we were drawing his circles. They were coming out looking like weird wankel rotator things.

The problem turned out to be a spline thing: stress management and periodicity. But before sorting that out I went down this long rabbit hole about transformation matrices, and figuring out whether a given matrix contains a shear component.

You can go down the rabbit hole yourself if you want to: just pull out your linear algebra textbook, and spend some quality time at Wikipedia’s extensive math pages, but it all comes back to a very simple question: if I take two vectors that are perpendicular before I transform them, will they still be perpendicular after I transform them? Check for each pair of vectors in a basis set, and you’re done.

So I think the check for shear is (T * x) dot (T * y) != 0 => implies shear. Seem sensible enough?

Why does this matter? Tune in another time.

Why you shouldn’t get an accountant

July 6th, 2008

Often people who are starting self-employment are advised to get an accountant. This is usually overkill.

First let’s look at what an accountant is. An accountant is a professional or quasi-professional who is skilled at auditing, determining the costs of things, and assessing the value of different business strategies. Given enough data and time, an accountant can tell you how much it costs per day to keep a pallet of Cascade on a shelf in a warehouse, whether you’re selling enough Cascade to justify the inventory you’re keeping, and how much high you’d need to price off-brand dishwashing soap in order to get an acceptable return on capital. Accounting is not particularly dry or boring, in my very limited experience with it.

But since as a software solo practitioner, you’re not managing inventories, have very low captial requirements, and relatively simple expenses, you don’t have much need for the higher-level abilities of an accountant. All you really need is to keep track of your income and expenses, generate a couple of reports at your year-end, and file minimal tax forms with the state/provincial and federal governments. Hiring an accountant to do this is like hiring a Jamie Zawinski to ‘program HTML’.

All you really need is a bookkeeper. So let’s look at the typical transactions after a year of software self-employment: you’ll write a dozen salary checks, deposit checks from at most 2-3 clients per month, and reimburse yourself for some expenses. You will depreciate a computer and maybe your desk and chair. That’s at most 100 transactions per year. For many people it will be more like fifty. Even if you save up all your bookkeeping and do it at the end of the year, it shouldn’t take you more than an hour. (I just looked at my last year’s books and we did 111 transactions, including maintaining bank accounts in two currencies and two sets of monthly payroll checks.)

So: why should you do your own bookkeeping? I’m sure it feels nice when you hand over your mess of papers and receipts and check stubs to somebody else and get back a neatly printed tax return. It feels less nice when you reflect that you’re paying at least $1,000 for work that you could have done in an hour or two — and if you spread the work out over the year, as I recommend, you don’t even notice the minute here and minute there that you spend on it.

Incidentally, $1,000 for a simple business tax return might be obsolete. After the Enron/Andersen mess, accountants got slapped with a pile of new regulation which, as far as I can tell, makes them more liable for failing to detect and document your mistakes, evasions, etc. This has the valuable effect of making accountants more careful and meticulous when they prepare books for large publicly traded companies. It has the irritating effect of making accountants treat everybody as a potential Enron, and also inspiring accountants to carry professional liability insurance, which of course increases their rates.

The other thing you can get from an accountant — auditing — is not particularly useful to you as a solo practitioner. If you’re the only employee and the boss and the shareholders, you can only steal from yourself. (Arguably you could be trying to steal money from the government by cheating on your taxes, but it’s not really the accountant’s job to prevent that.)

Directions this will eventually go: how to set up your bookkeeping, some more discussion of the accountability (or lack thereof) including responding to some of the points raised by Ben in his comments about startups vs. solo practice.

Use STL lower_bound even though it’s slower

July 3rd, 2008

Summarizing the last few posts on STL lower_bound: instead of falling back to a naive algorithm when it’s not given a random-access iterator, STL’s lower_bound algorithm still performs binary search using bidirectional or forward-only iterators. This causes it to perform 2N list traversals, which is four times as many as the naive algorithm’s average case of N/2 list traversals. The STL algorithm guarantees that there will only be O( log N ) comparisons, but we’ve seen that’s not much of an advantage unless the comparison requires an extra memory access.

You might be thinking: why would anybody ever do lower_bound on a list anyway? Just maintain the data in a vector: that has O(1) amortized append, and the STL lower_bound wil run in O( log n ). Remember that the disadvantage of a vector is not merely the O(n) cost of inserting/deleting in the middle: adding an element to a vector invalidates all of its iterators because the addition can, in principle, realloc.

So given that, for lists, naive_lower_bound is clearer, shorter, and faster for large N even when comparisons are costly, when should you use it instead of STL lower_bound? Never. Always use STL lower_bound. The maintenance costs of naive_lower_bound are too high. If you put it in your code, even if you comment it clearly and provide a link to this discussion, in six months or two or ten years, some maintenance programmer is going to be looking at that line of code and wondering “What was wrong with STL lower_bound? Is it still true that naive_lower_bound is fast? Is it even correct?” And that maintenance programmer might well be you. It’s like the old adage: Nobody ever got fired for buying IBM. Or Microsoft. Or today, STL.

Of course, like all generalizations, this one is partially false. If your program is the client for a popular MMORPG and you offer your users a discount if they let you run numerical algorithms on their machine while the client is running, and you need to maintain a linked list of all other available clients, sorted by latency or some other simple metric of speed/quality, and frequently partition that list using lower_bound periodically, then perhaps it makes sense for you to avoid STL lower_bound. But maybe you could use a vector instead. Or maybe instead of maintaining a single sorted list, you could decide a priori to divide your clients into slow, medium, and fast links based on latency < 50ms, < 200 ms, over 200 ms. There are many available ways to approach the problem, and improving the algorithm’s running time — although it’s in some ways the easiest to measure and implement — carries with it maintenance costs that you won’t have to pay.

STL lower_bound is better for complex keys, but still loses on large N

July 2nd, 2008

Testing a list<int> is pretty much the worst possible case for STL’s lower bound algorithm. Recall from the analysis that STL’s lower_bound uses a binary search algorithm even when the cost of advancing an iterator N steps is O(N). The advantage of this algorithm is that it only performs O( log N ) comparisons, as opposed to O( N ) comparisons for the naive algorithm.

Testing STL lower_bound against naive_lower_bound using integers definitely plays to the strengths of naive_lower_bound: an integer compare is a single machine instruction. To test the effects of a more-costly comparison, I used the simplest costly key I could think of: a string consisting of a prefix of underscores, followed by the sequence number formatted in decimal. I ran with one, ten, and 64 underscores; 64 is the cache line length on my machine, so this length forces a comparison to make four memory accesses (two per argument) instead of only two.

For these tests, the STL lower_bound is initially twice as fast as the naive_lower_bound, and I must say it’s a bit of a relief to see the standard library algorithm outperforming the dumb algorithm. But as the cache effects start to be felt (for me, around 30,000 elements) the sheer cost of traversing the whole list twice wipes out the savings from reduced comparisons, at least for 1 and 10 underscores.

Recall from the earlier analysis that the naive algorithm runs in N * (c + t) / 2 for the average case, where c is the average cost of a comparison and t is the average cost of a list traversal. So the overall slope of the naive algorithm, for large N, should be (c + t) / 2. In contrast, the STL algorithm runs in 2 * t * N + c * lg N. Just for coarse analysis, we can throw away the c * lg N term (lg N is effectively a constant, varying between 10 and 16 when N goes from 1,000 to 60,000), which gives 2 * t for the slope of the STL algorithm. The difference between the expected slopes is (3t - c)/2.

With this in mind, you can see that even when there’s a substantial constant prefix on the key (ten underscores) this doesn’t appreciably raise the cost of a comparison, at least not compared to the cost of three list traversals. So let’s try forcing two more memory accesses per comparion by extending the prefix length to 64 bytes, the full length of a cache line.

64 bytes of underscores cause each comparison to make 4 memory accesses, slowing naive substantially

Now the cost of a comparison is much closer to the cost of three list traversals, but still slightly less: the graph shows that the execution time of STL is still increasing more quickly with increasing N. The second memory access of the comparison doesn’t suffer the full random-access penalty of a true cache miss because the key is contiguous and the cache reads ahead, while for the list traversal it’s quite unlikely that any two adjacent list elements would be within 64 bytes of each other.

Overall, though, for most small-to-medium list sizes the STL algorithm is winning when the key is more complex than a builtin type.

Next time: why this matters, when to use the STL lower_bound, and more.

STL lower_bound on lists four times slower than naive algorithm

June 29th, 2008

Here is a comparison between the naive lower bound and STL’s version. Execution time reported in seconds was measured using the Windows high resolution timer. An STL list of integers 1-N was created by populating a vector, shuffling it, pushing the elements onto a list, and sorting the list. This was done so the list elements would not be allocated sequentially and thus benefit from cache locality.

For each N, 200 evenly spaced numbers in the range 1..N were selected. To these were added 0 and N+1, so that boundary conditions would be tested. The test values were shuffled to disrupt cache effects. A preliminary test was done to ensure that both algorithms got the same results for all test values. Then the lower_bound operation was timed, for each test value, repeated ten times.

Naive lower_bound about twice as fast as STL without optimization

As expected, the cost of memory access dominates the cost of the comparison, so the STL algorithm’s guarantee of fewer comparisons (at the cost of a guaranteed 2N memory accesses) turns out to be a bad trade. Note the knee in the graph, seen most clearly in the relative times (yellow triangles). I believe what happens there is the entire data set no longer fits in cache, and the STL algorithm suffers disproportionately from this because its first step — calculating the length of the list — causes the first elements to spill from cache as the last elements are being traversed. Then when it goes back to the beginning and starts its binary search, the first elements need to be reloaded from cache.

The results for a build with optimization turned on are similar:

With optimization naive is 4 times as fast as STL.

In this case the Naive algorithm is consistently 4 times faster than STL, and you can clearly see the knee in the STL graph around the first tick mark — effectively there are two linear portions of the graph. Interestingly, both STL and Naive suffer similarly from cache effects, as the relative timings stay constant as we pass the knee and get to the cache-limited region. Also (though you can’t see it because I left off the values of N), the numeric value of N where the knee occurs is larger in the optimized version (45,000 vs. 20,000), probably reflecting the overhead that the debug memory allocator attaches to each block.

Visual Studio 6.0 list::sort erases some elements when more than 32767 are present

June 28th, 2008

Update: This is a known bug.

In the process of gathering my data for the next post in the stl lower_bound() series, I ran into some weird timings. Basically I was seeing a linear increase until there were 32000 elements in my list, and then at 33000 the timing suddenly dropped by a factor of over 10.

I traced it back to a bug in MSVC 6.0 list::sort. I don’t know what the mechanics of the bug are, and I haven’t confirmed it widely, so I’d appreciate some feedback here. The two machines I’ve tried it on were running MSVC6.0 SP6 and had the Microsoft Platfrom SDK from February 2003 installed as well.

The repro code is below. Create a new Visual C++ Win32 console project, create an empty project, add file main.cpp, and compile. I get an assertion failure on line 34, following the second list.sort(). That is: list.sort() succeeds when there are 32767 elements in the list, but when the 32768th is added, the sort algorithm seems to be deleting 32768 elements.


#include <list>
#include <algorithm>

#include <assert.h>

using namespace std;

struct counter
{
  typedef int result_type;

  counter() : n(0) {}
  result_type operator()() { return n++; }

  result_type n;
};

int main()
{
  list<int> l;

  generate_n( back_inserter( l ), 32767, counter() );

  assert( l.size() == 32767 );
  l.sort();
  assert( l.size() == 32767 );

  l.clear();

  generate_n( back_inserter( l ), 32768, counter() );

  assert( l.size() == 32768 );
  l.sort();
  assert( l.size() == 32768 );

  l.clear();

  return 0;
}

If anyone else can reproduce this with MSVC6.0 SP6, please let me know.

Analyzing STL lower_bound: why does it iterate the whole list twice?

June 27th, 2008

I criticized an STL algorithm in a previous entry, calling it “unintuitive but correct.”. That’s pretty arrogant of me, considering the STL has been published for fifteen or so years. You’d think that if an STL algorithm were incorrect, someone would have noticed by now. Or taken differently, STL doesn’t need an obscure blogger’s approval.

Anyway, we’re here to criticize algorithms here, not listen to me moan.

Here’s the naive algorithm (in C++, yet! And corrected to *cough* test the termination condition):


template< class iter, class val >
inline iter naive_lower_bound( iter l, iter u, const val& v)
{
  iter r = l;

  while( l != u && *l <= v )
  {
    r = l++;
  }

  return r;
}

Here’s the STL lower_bound algorithm from the latest version of SGI STL. (That would be v3.3 released in 2000):


template <class _ForwardIter, class _Tp, class _Distance>
_ForwardIter __lower_bound(_ForwardIter __first, _ForwardIter __last,
const _Tp& __val, _Distance*)
{
 1  _Distance __len = 0;
 2  distance(__first, __last, __len);
 3  _Distance __half;
 4  _ForwardIter __middle;
 5
 6  while (__len > 0) {
 7    __half = __len >> 1;
 8    __middle = __first;
 9    advance(__middle, __half);
10    if (*__middle < __val) {
11      __first = __middle;
12      ++__first;
13      __len = __len - __half - 1;
14    }
15    else
16      __len = __half;
17  }
18  return __first;
}

You can tell it’s an STL algorithm because it’s elegant, graceful, and full of $%! underscores. (Did Cfront need that horrible convention?)

So first let’s look at line 2. Calculating the distance() between two iterators is O(1) for a random access iterator, but O(n) for all other iterator types. The naive algorithm doesn’t calculate the distance at all.

Next think about the case where the correct lower bound is element l. The naive algorithm finds this in two list accesses: when *(l+1) is greater than v, it terminates and returns. STL walks all the way to the halfway element and inspects it first. It’s way too big, so we reduce len and partition again. We actually wind up traversing N elements to get the answer.

For the middle case or end case, a similar analysis shows that STL runs in Ω(n), about 2*N*t + C * lg N, where t and C are the average traversal and comparison costs. In contrast, the naive approach runs in an expected N*(C+t)/2 time. In the very worst case, drop the two.

So what’s the big deal? Both algorithms are O(n), differing only in a constant factor. Besides, the STL one is documented to use O(n) time but only do O(lg n) comparisons — a feature the naive algorithm notably lacks. So what’s the big deal?

The problem, in a word: caching. Next time: data!

Bleg: Comment auto-approval

June 27th, 2008

I currently have to manually approve every commentor the first time they comment (and every time there’s a new email address, WordPress thinks it’s a different person). That sucks for me and the commentors, and especially when I’m away from the keyboard for long stretches of time, it kills the usefulness of the comment area.

So is Akismet enough? Can I just turn off comment moderation? Or is there some WordPress plugin that helps with this? Should I be encouraging people to register by promising to sell their email addresses to Uzbekistani peddlers of bovine porn? Err…. I mean, promising to never sell their email addresses….

Anyway, I’m going to try turning off moderation for a while and see what happens. Akismet is still on. Suggestions in the comment box or by email to $LEFT( “sambal.org”, 3 ) + “@sambal.org”.