Interview answers
In the previous post, I’ve come up with this interview question:
In a given .NET string, assume there are line breaks in standard \r\n form (basically Environment.NewLine).
Write a method that inserts a space between two consecutive line breaks to separate any two line breaks from each other.
Update: Rik Hemsley has posted an absolutely fantastic summary in form of a Visual Studio solution with all the answers from the comments, unit-tests and benchmarks:
https://code.google.com/p/kirill-question/
Jordan Terrell’s RegExp-based solution seems to be the fastest one.
Thanks Rik! This is very helpful!
Source of the problem
First I’ll explain how I came up with the problem. I was writing some code that generated HTML, specifically, some text inside the <PRE> tag, interspersed with <BR /> tags. I’m not an expert in HTML, that’s why it came as a surprise to me that if you have multiple consecutive <BR /> elements, the browser will only render one line break (well, at least the browser I checked it on - IE8). My code generated the <BR /> tags from Environment.NewLine substrings in the source text. So, to preserve empty lines in the resulting HTML, I had to intersperse the line breaks with at least one space – then IE would keep the empty lines when rendering <PRE> contents. I didn’t keep direct line breaks because some HTML viewers that I care about don’t render them correctly.
My original solution
Anyway, I’ve gotten a lot of great responses that contain a lot of food for thought. Mostly I was amazed how the responses actually reflected my own thinking as I was coming up with a solution. My ultimate solution (which was also correctly suggested by ram) was this:
public static string IntersperseLineBreaks(string source)
{
string result = source.Replace("\n\r", "\n \r");
return result;
}
A temporary local variable here is for your debugging convenience until my team comes up with a way to inspect the return value of a method in the debugger.
Ambiguous problem statement
Of course, banko correctly noted that the problem is stated ambiguosly, since Environment.NewLine is “\r\n” on Windows, and “\n” on Unix. Besides, as Jordan Terrell mentions, the behavior is unclear on source texts that contain wild mixes like “\r\n\r” and “\r\n\n”. Jordan also suggests that I include unit-tests to specify correct solutions.
Well, I apologize for a problem statement that is not specified well enough (I simply didn’t have time to be that thorough!). I didn’t do it on purpose, but now when I think about it, most real-world requirements are ambiguous and not well specified. It is always interesting to see how a candidate deals with ambiguity and unclear requirements. So be prepared to expect unclear problem statements in your interviews and deal with them gracefully.
There are several good strategies to impress the interviewer in this case, the simplest being to ask clarifying questions in the places where you think the requirements are unclear. Let them know that you are trying to explicitly understand and model the requirements in your head and are filling in the gaps.
Another approach is to explicitly identify and parameterize the variability in the requirements and give a generic solution that works in all cases. Finally, if the problem statement is unclear, you can totally assert, assume or demand requirements that make your life easier. For instance, I remember I said “let’s assume that C# has tuples” during my interview at Microsoft. The interviewer laughed and said OK. I produced a nice solution that used tuples, and then he asked how would I change it to not use tuples. But that’s a totally different question ;)
Assessment criteria
If I were an interviewer (and I’ve never been one yet) and I would happen to ask this question, I guess I would look at several things in answers:
- Code reuse (e.g. my solution above relies on the Replace implementation in the framework vs. reimplementing a custom version of it for this particular problem, which might be error prone). However the point of interview questions is usually to see how people implement algorithms, so I guess for a sorting question an answer like List<T>.Sort() would rate very high on code reuse, but pretty much useless otherwise :)
- Correctness – some people gave similar solutions to mine, which call replace twice, but are more resilient to “weird input streams” with non-normalized line breaks. Such solutions will be twice as slow (since they will do full string replacement twice), but if a candidate understands the trade-offs between performance and correctness, I’m totally fine with it.
- Performance – writing your own StringBuilder-based implementation of Replace specifically suited for this particular problem is totally OK. In fact, some people gave very efficient solutions using this approach (e.g. Jared Parsons (a developer who I work with, who sits in the office down the hall) had built a custom ‘replacer’ state machine in F#) and Thomas Levesque correctly used a StringBuilder to avoid unnecessary string allocations. Some people fell into the trap of using string concatenation instead of a StringBuilder, which is by far less efficient, since it allocates ~ O(n^2) memory vs. O(n log n) memory allocated by StringBuilder. I would want a .NET candidate to understand how strings work in .NET, what are allocations, how GC works, and why string concatenation is bad when used in a loop.
- Style – having a concise declarative version is great when performance is not critical. Having readable code that is expressive and easy to maintain can sometimes trump performance. Moreover, usually people who are able to express their thoughts in code concisely will likely also master the skill of performance tuning when necessary.
So I guess it’s impossible to declare a single winner because different solutions win when different assessment criteria are used. In terms of readability, getting things done and reusing as much as possible, I like ram's solution (and mine :P), even if it won’t work in certain scenarios (I’m OK with making assumptions about the source text). In terms of algorithm implementation, other solutions are certainly much better, those that use a StringBuilder or build the resulting array char-by-char. Finally, in terms of expressiveness I really liked some declarative solutions (using LINQ), even if they’re less effective (e.g. use string concatenation).
string.Replace(“\r\n\r\n”, “\r\n \r\n”)
One solution deserves some special explanation since it nicely demonstrates the core of the problem. The Replace method continues matching substrings after the end of the replacement, so if your replacement patterns overlap, it won’t replace the second one:
To overcome this problem, some people suggested the following:
while (str.Contains("\r\n\r\n"))
str = str.Replace("\r\n\r\n", "\r\n \r\n");
Note that this loop will never run more than twice (this is where I goofed myself). GSEJ corrected me, pointing out that only two passes are necessary – first pass leaves a maximum of two consecutive line breaks, second pass eliminates them all.
However we notice that in the replacement pattern (“\r\n\r\n”, “\r\n \r\n”), the start and the end are the same (hence the first and last chars overlap with previous and next occurrences of the pattern). The pattern is redundant, so we can make it shorter and only replace the crux of the pattern: Replace(“\n\r”, “\n \r”):
This pattern is non-redundant and doesn’t overlap with the neighbors, hence no second pass required.
Well, anyway, thanks everyone for your answers! It was truly interesting for me and hopefully educational and entertaining for some of you.
Comments
Anonymous
March 30, 2010
you can speed up Jordan's by precompiling the RegEx and caching it in a static: static readonly Regex _re = new Regex (@"(rn)(?=rn)", RegexOptions.Compiled); public override string InsertSpaceBetweenCrLfs(string input) { return _re.Replace (input, "$1 "); } however, this is nearly twice as fast (on my box): public override string InsertSpaceBetweenCrLfs (string input) { // check for null if (input == null) throw new ArgumentNullException ("input"); int cch = input.Length; // simple boundary condition, helps later if (cch < 4) return input; // pre-allocate space var sb = new StringBuilder (cch * 3 / 2 + 1); // handle boundary condtion later, instead of in loop cch -= 3; int ichStart = 0; // the start of the current span int ichEnd = 0; // the end of the current span while (ichEnd < cch) { int ch = input [ichEnd++]; // check for double newlines, won't overflow if (ch == 'r' && input [ichEnd] == 'n' && input [ichEnd + 1] == 'r' && input [ichEnd + 2] == 'n') { // copy the previous span sb.Append (input, ichStart, ichEnd - ichStart); // add a space sb.Append (' '); // start the new span ichStart = ++ichEnd; } } // copy the remaining characters sb.Append (input, ichStart, cch - ichStart + 3); return sb.ToString (); }Anonymous
March 30, 2010
Kirill, apologies for spelling your name incorrectly initially. I hate it when people do that to me! I've put the code here: http://code.google.com/p/kirill-question/ PiersH, your solution isn't as fast as the the regular expression version, on my test data, which is simply 2MB of Lorem Ipsum. I chose this data as it seemed likely it would be similar to real world scenarios. I've also added the caching to Jordan's version and, as expected, it speeds up somewhat. In normal code, I generally don't bother caching regular expressions, as they tend not to be (de-)allocated in a tight loop, but my artificial benchmark does this, so we get to see the difference caching can make.Anonymous
March 31, 2010
@Rik, i noticed a little copy/paste bug here: AddBenchmark("PiersH2", () => new PiersH().InsertSpaceBetweenCrLfs(text), benchmark); once fixed, i get these results: Name Milliseconds Ddkk 17.51445000 JaredParsons 1828.32110000 JordanTerrell 19.67060196 JordanTerrellCached 17.04021525 KooKiz2 20.55781633 PiersH 49.96797619 PiersH2 9.71894175 Robert2 20.53869184Anonymous
March 31, 2010
Rik: no problem at all! I really appreciate you summarizing the solutions and giving us a way to benchmark and compare them. Some really interesting results there! PiersH2: oh wow, it's awesome that you were able to bring it down to 9.7Anonymous
April 01, 2010
PiersH: I fixed my code to call your updated solution, but I don't see it running that fast. Do you have some updated code? It also fails the unit tests. Example: "rnrnrn" Expected: "rn rn rn" But was: "r r rn" Name Milliseconds Ddkk 14.25188310 fholm 59.60957647 JaredParsons 1222.55470000 JordanTerrell 13.11744545 JordanTerrellCached 10.87201828 KooKiz2 16.34176129 PiersH 49.39638571 PiersH2 20.95233333 Robert2 16.28622097 Warturtle 47.26178636Anonymous
April 02, 2010
Cool. By the way, what tool do you use for making the image? It looks great.Anonymous
April 03, 2010
Thanks Benjamin :) I use PowerPoint (2010) SmartArtAnonymous
April 08, 2010
I get similar timings to Piers. (In Release build (x64))