Breadth is sometimes better than depth

Artigo
09/27/2004

A while back the Scripting Guys blogged about using recursion to list all the files in a directory and all its subdirectories. Something that you'll notice about this very standard solution to this problem is that it is "depth first". That is, it lists the files like this:

c:\xyz.txt
c:\foo\bbb.txt
c:\foo\abc\def.txt
c:\foo\bar\baz.txt
c:\foo\bar\blah.txt
c:\qaz\zaq.txt
c:\qaz\lll\ggg.txt

It picks a branch and follows it down the tree as far as it can, listing directories as it goes, retreats back up as little as possible, picks another branch to go down, goes as far down it as possible, and so on, until the entire tree has been explored. You can see why this is called "depth first".

You don't need to use recursion to do this; recursion is just an elegant way to solve the problem. The essence of the algorithm is that information about the next node to process is stored in a stack. The algorithm is simple: pop the stack, process the directory, push all the subdirectories onto the stack, repeat until the stack is empty. Recursion just lets you do this algorithm without making the stack explicit; the stack is the call stack and doing the recursive call pushes a new frame on the call stack for you. We can easily do this without recursion by making the stack explicit:

Notice that I've wrapped the file manipulation code in try-catches. As I've

mentioned before, in this script if there is a problem reading a file -- like, I don't own it and therefore I get a security violation -- I want to skip the problem and continue on. I do not want to write robust error handling for this trivial one-off script.

The other day I had a problem -- a few weeks earlier I had written a little test program and stuck it in a directory somewhere very near the top of my directory hierarchy, but I couldn't remember where exactly. I already had the depth-first program above, and could easily modify it to add a regular expression search of each file looking for a pattern. But there are some very deep directories on that drive where I knew there would be no matches, but would take a lot of time to search. I could have written code to omit those directories, but there was another way to solve this problem: since I knew that the file was going to be somewhere shallow, don't do a depth-first search in the first place. Do a breadth-first search. That is, search all the files in the top level, then all the files that are one deep, then all the files that are two deep, then three deep, and so on.

The nice thing is that it's basically the same algorithm, just with a minor change. Instead of doing a last-in-first-out stack, we'll just do a first-in-first-out queue and magically get a breadth-first traversal:

This then puts out the files in the order

c:\xyz.txt
c:\qaz\zaq.txt
c:\foo\bbb.txt
c:\qaz\lll\ggg.txt
c:\foo\abc\def.txt
c:\foo\bar\baz.txt
c:\foo\bar\blah.txt

going from shallowest to deepest gradually rather than continually diving way down and then coming back up.

Comments

Anonymous
September 27, 2004
The comment has been removed
Anonymous
September 27, 2004
The queue idea is very slick. Perhaps now I can convince my wife that breadth is better than depth! ;)
Anonymous
September 27, 2004
Indeed, depth-first is a bad way to search the web. There are branches of the web which are for all practical purposes infinitely deep, but very few pages are infinitely broad.
Anonymous
September 27, 2004
Why not use .push() and .shift() instead of the manual array manipulation in the second example? Unless you're using IE 5.01 of course. :)
Anonymous
September 27, 2004
Shift is an array copy. If there are a thousand items in the array, that's a thousand property creations, a thousand property deletions and a thousand value copies every time it's called. Very expensive.
Anonymous
September 27, 2004
This is a trade off - like with everything else in life: would you trade the simplicity in coding for the memory eaten by your stack? When the 200G drives are becoming a normal thing what would be amount of files and folders you have to "push" and "pop"?

My 20G logical drive for "all things work" has 20 000 folders sometimes up to 20 deep - i like my folders organised...

I do agree, however, that one should look at recursive functions in the morning somewhere between the second cup of cofee and lunch.
Anonymous
September 27, 2004
Recently I needed to extend our build process to perform a simple textual manipulation on some files: replace a placeholder with the version text. I implemented this step as a .wsf file using JScript and the BeyondJS library. The reason: BeyondJS made it very easy to perform a sequence of operations on all the files in a subdirectory that matched a particular pattern. For example, this code snippet performs the same operation as Eric's second sample (it's breadth-first):

File.recurse(".").foreach(alert);

"alert" in this case is a wrapper function for WScript.Echo. Yes, I've cut a few corners: it is recursive and I've neglected the try-catch. I still think it's cool.

Here is the code to output only .htm or .html files:

File.recurse(".").filter(/.html?$/i).foreach(alert);
Anonymous
September 27, 2004
The comment has been removed
Anonymous
September 27, 2004
JScript may not have tail recursion, but Cocoon's Rhino-based JavaScript implementation does. It even sports continuations:

http://wiki.apache.org/cocoon/RhinoWithContinuations
Anonymous
September 27, 2004
That's pretty darn cool -- but of course in this case the point is moot because this isn't an algorithm that can be made tail recursive.

To be tail recursive, the function can't do any work that needs the state of the local variables after the recursive call, but in this algorithm, the state of the enumerator is needed after the recursive call.
Anonymous
September 27, 2004
The comment has been removed
Anonymous
September 27, 2004
The comment has been removed
Anonymous
September 30, 2004
Eric, are you sure about how shift works? Here are my timing results for shifting/popping an array of 5000 numbers and 5000 relatively large objects. The times are almost identical regardless of the type of element (in seconds):

operation, numbers, strings, objects
pop, 3359, 3360, 3343
shift, 12796, 12844, 12829

If it was actually copying the whole (remaining) array on each iteration of shift I would have thought the difference would be much more than 3.8x slower, and that times would vary by element type. Here's the code:

var a = [];
var n = 5000;
for ( var i=0; i < n; i++ )
a[i] = i-12;
var t1 = new Date();
for ( var i=0; i < n; i++ )
t = a.shift();
var t2 = new Date();
alert(t2.getTime()-t1.getTime());
Anonymous
September 30, 2004
The comment has been removed
Anonymous
October 04, 2004
The comment has been removed
Anonymous
October 05, 2004
The comment has been removed

Compartilhar via

Breadth is sometimes better than depth

Comments

Recursos adicionais