Caterwaul 1.1 in beta

The 1.0.x series has been stable for a month or two, and now I’m actively developing Caterwaul 1.1. Major changes include:

  • A much more structured macro system (no more arrays of macroexpanders) that is extremely performant.
  • Removal of a lot of redundancy from the standard library. This will break code written for 1.0.x, but the conversion is straightforward.
  • Sequence library enhancements (of course).
  • Removal of the precompiler. I did this partially out of laziness, but partially because precompilation was a hack to work around the performance problems of the macroexpander. Now that macroexpansion is fast, I’m not worried about it.
  • Removal of tracing. This should be a separate project, since it has nothing to do with everyday Caterwaul programming.

Should be an exciting few months as the 1.1 releases become stable. I’m also shooting for CoffeeScript integration for 1.2 or 1.3.

Zooming out

We never wonder how two buildings are connected to one another. We may not know the details, but we can always see the big picture by walking far away and then looking at the whole picture.

Why can’t we do the same with software?

Imagine 100,000 lines of code. Who knows how they connect? What does the call graph look like? How coupled is a module? If the language is statically typed, you have some information about this, though it may still not be obvious whether it’s safe to remove something.

We’re always looking at the details with code because we don’t have anything like perspective projection. Humans would be similarly handicapped if our eyes focused in a narrow beam instead of fanning out.

There’s a heuristic assumed here: Things far away are less important than things up close. Eyesight with perspective projection is a brilliant user interface to capture that. It isn’t perfect, of course; sometimes there is action at a distance, and that violates the assumption of locality. But in general we’d much rather see things the way we do than be constrained to orthographic perspective.

Caterwaul 1.0 beta!

Caterwaul has come a long way in the past few weeks, and last weekend I set up the main Caterwaul page. Check it out, let me know if you run into problems, and expect the official 1.0 release in the coming weeks!

How the Internet tracks you

Here are some things that websites can do to bypass traditional privacy guards put in place by browsers. I wasn’t aware of most of these until I started contracting for an advertising company.

  1. Install a “zombie cookie.” This cookie is designed to last forever, even after you clear your cache, history, and cookies. It can do this because it’s stored in Flash, HTML5 local storage, Java, and maybe other places, and repopulates each storage location if any of them is found. (Though see below for a way to get rid of them.) Because of this last feature, the cookie propagates to all browsers installed on the same system if they share Flash/Java data (which generally happens if they’re on the same user account).
  2. Bypass the cross-domain AJAX restriction. There are a few well-known techniques to do this including JSONP and using an <img> tag to request a URL. This is significant because a poorly written website can be vulnerable to scripts which send form data to a third party without your knowledge.
  3. Bypass the cross-frame restriction using either HTML5 postMessage or a hack involving document.location.hash (which one frame can use to write to another). This is significant because it means that advertisers can potentially read content (and possibly passwords, if they’re unscrupulous) from the parent page and associate it with your ID, even when placed inside an iframe.
  4. Query for particular URLs your history. This is an interesting CSS trick, but here’s the template:
    a.link1 {background-image: url('http://server.com/url1-not-visited')}
    a.link1:visited {background-image: url('http://server.com/url1-visited')}
    Then in the HTML document, this link will query the URL:
    <a class="link1" href="http://website.com"></a>
    If the user has visited the website, a request will be made to http://server.com/url1-visited; otherwise a request will be made to http://server.com/url1-not-visited. And, of course, Javascript allows a page to query many URLs quickly, refining its search pattern as it finds more matches.
  5. Buy data from a third party. One of these parties is a company called Rapleaf, which maintains, probably through zombie cookies, a database of user identities and information about them. They can ascertain with some accuracy your age, sex, interests, and possibly more.
  6. Determine your position to within a few meters. This is a new HTML5 feature (and properly designed so that you don’t enable it by accident). The idea is that your browser looks for local wifi networks with known locations, then triangulates the location of your computer based on relative signal strengths.

Fortunately, it’s really easy for Linux users to defeat any kind of long-term tracking. Just use strace to look for files opened in write-mode:

$ strace browser 2>&1 | grep '^open(.*WR'

Then use Git to version-control these files. (This defeated the Evercookie when I tried it.) Provided that you start with a clean browser installation, you won’t be carrying around any zombie cookies. (It also has the advantage that you can use different Git branches to have different browser profiles.)

Xenophobic Arrays in IE7

My project was progressing quickly, cross-browser testing having succeeded in Chrome, Firefox, Safari, and IE8. Then I ran it on IE7 and found this mysterious error:

“constructor is null or not an object.”

As usual with IE, the line number and column were meaningless, so I looked for variables called “constructor.” Curiously, there weren’t any. Perhaps it was abbreviating an expression like “x.constructor.foo”, and saying that “x.constructor” is undefined. So I looked for such expressions, and found many occurrences of expressions like this:

return new x.constructor(…);

Suspicious that IE was pointing me in the wrong direction, I created a test case to reproduce the error. Sure enough, the message is different. IE gives the significantly more laconic “object expected” error if you call an undefined function.

This left only one possibility, that x was null or undefined and the error message was wrong. Sure enough, this test produces the misleading message “constructor is null or not an object”:

var x = null;
alert(x.constructor);

With that bit of confusion out of the way, I started searching through the code to find places where I could possibly be asking for the constructor of a null or undefined value. I figured it had to be a fairly weird case, since the program worked just fine on other browsers and the failing module had no DOM references.

After commenting out various constructor use sites, I finally found the problem. The expression looked like this:

if (xs[xs.length - 1].constructor === Function) {...}

I commented out the line and ran my unit tests again. This time there were no errors, but a lot of failures. Every array I made an assertion about appeared to be empty. For example:

var xs = new caterwaul.seq.finite([1, 2, 3, 4, 5]);
assert_equal(xs.length, 5);  // Fails: xs.length is 0

(This happens only on IE7, remember. Other browsers work just fine.)

This test isolated the problem perfectly:

var f = function () {};
f.prototype = [];
f.prototype.constructor = f;
var x = new f();
x.length = 10;
alert(x.length);  // alerts 0 on IE7

The reason for this weirdness is that IE7 uses a getter/setter property for “length” (which violates the ECMA standards); other browsers treat length normally. Getters and setters create a big problem for prototypical inheritance patterns like the one I was trying to establish: you can’t override something with a setter.

It gets worse, too. I thought of a hack to work around the problem. If IE7 is the only failure case, then some special logic to use the native push() method to update the length should do the trick. So something like this:

var expected_length = 10;
while (xs.length < expected_length) xs.push(xs[xs.length - 1]);

In any reasonable system this would update the length, taking O(n) time but otherwise not causing problems. However, IE7′s implementation of push() detects that xs is an object, not a native array, and silently does nothing. So in fact the above code loops forever if xs inherits from, but isn’t, a real array.

As a last attempt, I tried externally invoking Array.prototype.push on xs and xs[xs.length - 1]. This had the same effect as calling push() directly.

The solution I’m going with is to create a size() method and have the logic failover to using .length if that’s unavailable. If anyone knows of a better solution (preserving Array in the inheritance chain), I’d love to hear it.

Inside Self-Modifying Perl

This weekend I’ve finally taken some time to document the thought process and development of self-modifying Perl scripts. I’ve got a guide that goes through writing one step by step, explaining each feature and how to implement it. At the end of the guide you have the object script from http://github.com/spencertipping/perl-objects.

It’s still a work in progress, but I’d love to hear any feedback: http://github.com/spencertipping/writing-self-modifying-perl.

Fluent with expressions

Suppose you were asked to implement the map() function in your favorite language. What would it look like? If the language doesn’t have list comprehensions, it would probably read something like this:

var map = function (f, xs) {
  var ys = [];
  for (var i = 0, l = xs.length; i < l; ++i)
    ys.push(f(xs[i]));
  return ys;
};

Two of these statements don't do anything useful. One is var ys = [], and the other is return ys. The only reason ys is even defined as a variable is because we have to side-effect over it. This problem is even more pronounced if you want a single-element hash mapping an arbitrary string to an arbitrary value:

var hash = function (k, v) {
  var result = {};
  result[k] = v;
  return result;
}

After giving this problem some thought, I found a nice way of expressing side-effects in Caterwaul. Basically, the boilerplate involved with executing side-effects comes from the overhead of creating multiple references to the same expression. In Javascript it takes a var something = e statement, and the ergonomics are terrible.

Caterwaul solves this by introducing /se[] and /re[] as postfix modifiers. For example:

var hash = function (k, v) {
  return {} /se[_[k] = v];
};

This definition is the same as before, but much more concise. The behavior of /se[] is to take the value on the left, bind it to _ inside the brackets, evaluate the bracketed code, and then return the value on the left (which presumably has been modified). /re[] is similar: take the value on the left, bind it to _ inside the brackets, and evaluate and return the bracketed expression.

After I wrote these macros coding became a lot more fun. I later realized why: it's much easier to keep typing than it is to go back and rewrite a piece of code. This happens anytime you've got a value from some expression, e.g. $('#foo').children(), and then you realize you need to refer to it multiple times. It requires a lot of keystrokes to refactor that value into a variable, but fewer to use /se[] or /re[]. It doesn't seem like such a small change could improve the language's usability so much, but because of how frequently this situation arises it's probably the single most useful thing about Caterwaul.

Databases and Noobs don’t mix

I’m not very good with databases. Yesterday I nuked some production data due to a sloppy command (luckily there was a backup from a few minutes ago, so we didn’t lose much). But it still bothered me that even though I was aware that I was working with important data I was still able to make that mistake within about an hour of logging in.

Some background: the task in question was a data conversion job against a Mongo database containing ad definitions. There were two things that needed to happen. First, we had to pull some properties out of a text field and put them directly onto the ad objects. Second, we had to propagate statistics from the ads onto their containing groups (basically, ads belong to other statistical entities for tracking purposes).

The mistake I made was that I zeroed out the ad statistics in addition to the stats stored for the parent entities. I had mentally separated the ads from the other entities, despite the fact that they were all stored in the same collection, and had run a query that looked like this (to prepare the database for aggregation):

db.statistics.update({}, {value: 0});

Obviously a dumb mistake, but it seemed reasonable when I typed it. After some thought, I came up with several reasons that this situation happened.

  1. I was in a hurry. I should have become more familiar with the operating environment ahead of time, but I perceived that time was important and didn’t bother. (Not doing this ended up taking longer than doing my homework ahead of time, of course.)
  2. I was unfamiliar with the Mongo shell. (In particular, I didn’t realize that you could load external Javascript files.) This was a problem because I needed to write about 50 lines of code to do the conversion, and I ended up trying to write them in the REPL rather than starting a file where I could see all of my code at once. I later found the function to load a file by typing help admin during the restore process and ended up rewriting my conversion script in a file for the final run.
  3. Mongo (and every other database I’ve worked with) has terrible ergonomics. On the filesystem you can type ls to see what it is you’re about to delete, and you get a nice n-column listing of filenames. In Mongo the corresponding command is something like db.collection.find({criteria}).map(function (x) {return x.id}), and it gives you a single-column listing of non-unique entries. Because it takes so long to type it out (and you get a not very useful listing), you’re less likely to look at what you’re about to nuke/modify before you do it.
  4. There is neither (1) an undo command, nor (2) a way to try out a command before running it. (Most SQL databases have the latter with transactions, but Mongo doesn’t as far as I know.)
  5. There also isn’t an easy way to quickly clone your database (e.g. cp -a dir1 dir2), or snapshot it (e.g. tar -cz dir > backup.tar.gz or git commit -a) to have a quick rollback option in case you make a mistake. This also means that you can’t do a trial run against a copy of your data, just to make sure it works.
  6. Every delete/update against a database operates like rm -rf; that is, the table is flat, so you’re potentially selecting against every row. rm has checks to make sure you don’t inadvertently delete directories (and for this reason the -r flag feels dangerous), and you can use file permissions to cause rm to issue warnings; databases probably have similar safety features, but they aren’t enabled by default.

Now the problem here obviously isn’t that databases are somehow broken. The Unix filesystem is certainly similarly treacherous for newbies, after all. But filesystems have been around for more than 40 years and for the most part the user interface hasn’t changed much. Mongo has been around for maybe five years? It’s going to be unfamiliar to a lot of people. Not having an “I’m a newbie, so ask me before I do anything I might not have intended” option in the database is as risky as having every user log in as root, disabling cp and tar, and aliasing rm to rm -rf. It’s not a problem if your users know exactly what they’re doing, but for newbies like me it’s a recipe for disaster.

let* and where* in Javascript (woohoo!)

Caterwaul has supported the let[x = y] in z syntax for some time, but recently I noticed that it could easily be extended to support let* verbatim, e.g. let*[x = y, a = x] in z. Version 0.2.3 adds support for these let-forms, which behave exactly as they would in Lisp or Scheme.

Also new in 0.2.3 is custom lvalue support. This enables a nice syntax for function definition: let[f(x) = x + 1] in …. Examples are up on the Caterwaul compiler page, and in the new documentation directory, which I’m building to illustrate how to use Caterwaul in real-world applications.

At first I didn’t think custom lvalues would be possible, since Javascript encodes lvalues into the grammar (so putting an invalid expression on the left-hand side of an assignment gives you a syntax error, not a runtime one). However, there’s a great old “feature” from IE that requires an exception to be made for function calls. Other browsers had to adopt it too, so now we have a semantically void construct that is syntactically valid. Under any other imaginable circumstances I would be complaining about IE right now, but considering the circumstances I’m actually quite grateful.

Caterwaul improvements

I released version 0.2 of Caterwaul a couple of days ago. Not much has changed in the external interface (though I do think the fb[][] and fb_[] macros are really handy — those preserve ‘this’ inside a closure). Most of the improvement is in the way macros and hard references are handled.

If you’ve worked with Caterwaul much (and are aware of how it works), you may have been surprised at the level of integration between the syntactic macro forms and the resulting code. Particularly, the fact that values produced by qs[] end up being actual references to syntax trees in a compiled function, so that you could write (function () {return qs[foo]}) and it would return a piece of syntax representing ‘foo’.

At first this was achieved by making qs[] a special form, then having a syntax-rewrite that went through and replaced each qs[] with a gensym. The gensym would be a closure variable in an enclosing function, which would immediately be called on the value for that gensym (in practice, multiple gensyms and values would be used). This worked great for qs[] and macroexpansion, but the problem came in when you wanted to define a macro that took a value from compile-time and put it into code. The clunky {tree: x, environment: y} hackery that Caterwaul used previously was unhelpful, since it existed only to expand qs[] forms.

With the release of 0.2 that whole mechanism is different. Now there are two kinds of syntax nodes. One represents syntactic constructs as before (this hasn’t changed), and the other represents a value that you want to stick into your code. For example:

var g = caterwaul.clone('std')(function () {
  var f = fn[x][x + 1];
  return caterwaul.compile(
    qs[function (x) {return _(x)}].s('_', new caterwaul.ref(f)));
})();
g(5)    // -> 6

Even though the function was compiled in the global scope, it can still have access to a local value if we drop that value into the syntax tree.

The machinery to support this isn’t terribly complex either. It’s actually conceptually similar to the old expand_qs function, but each syntax node has its own bindings() method to generate a hash of symbols to their values. This gets passed in to the enclosing environment function and reduced to variables, and those variables are referenced by whatever ref node created them.