Jaa


Zip Me Up

Suppose you’ve got a sequence of Foos and you want to project from that a sequences of Bars. That’s straightforward using LINQ:

IEnumerable<Bars> bars = from foo in foos select MakeBar(foo);

or, without the query sugar:

IEnumerable<Bars> bars = foos.Select(foo=>MakeBar(foo));

But what if you have two sequences that you want to project from? Say you’ve got two sequences of doubles that are the same length, and you want to project a sequences of points.

The operation of “project from two sequences of the same length” is called a “zip join”, because its like doing up a zipper – you need each member of the left sequence to correspond to a member of the right sequence. In the version of the base class library that will ship with C# 4.0, we’ll be adding a Zip sequence operator to the standard extension methods.

The code to do so is pretty trivial; if you happen to need this in C# 3.0, I’ve put the source code below.

We considered adding a query syntax for zip joins. Something like

from foo in foos, bar in bars select MakeBlah(foo, bar)

or

from (foo, bar) in (foos, bars) select MakeBlah(foo, bar)

or some such thing, which would be syntactically transformed into

foos.Zip(bars, (foo, bar)=>MakeBlah(foo, bar))

However, we decided that adding a query syntax for a relatively obscure operation that is not actually supported by most relational databases was not worth the cost of complicating the grammar and the syntactic translation rules.

Here’s the source code:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>
(this IEnumerable<TFirst> first,
IEnumerable<TSecond> second,
Func<TFirst, TSecond, TResult> resultSelector)
{
if (first == null) throw new ArgumentNullException("first");
if (second == null) throw new ArgumentNullException("second");
if (resultSelector == null) throw new ArgumentNullException("resultSelector");
return ZipIterator(first, second, resultSelector);
}

private static IEnumerable<TResult> ZipIterator<TFirst, TSecond, TResult>
(IEnumerable<TFirst> first,
IEnumerable<TSecond> second,
Func<TFirst, TSecond, TResult> resultSelector)
{
using (IEnumerator<TFirst> e1 = first.GetEnumerator())
using (IEnumerator<TSecond> e2 = second.GetEnumerator())
while (e1.MoveNext() && e2.MoveNext())
yield return resultSelector(e1.Current, e2.Current);
}

Comments

  • Anonymous
    May 07, 2009
    Sounds similar to the  List.zip and List.unzip functions in F#. Interesting why F# decided to have this on List instead of any sequence i.e IEnumerable<T>

  • Anonymous
    May 07, 2009
    The comment has been removed

  • Anonymous
    May 07, 2009
    The comment has been removed

  • Anonymous
    May 07, 2009
    Well, nothing stops you from calling it explicitly as Enumerable.Zip(), right? I often do that with Concat() for precisely the same reasons. (by the way, I wish Enumerable.Concat() was made vararg...)

  • Anonymous
    May 07, 2009
    Nice... A good part of C#3 IEnumerable<> extensions resemble what exists in Haskell, so I always felt like Zip() was missing. But Unzip() can probably not be added, unless some standard tuples are added too. I feel like I could have used the general syntax for zip joins to replace the majority of my 'for' loops (as opposed to 'foreach' loops). The main use of indices is for looking up in other collections in a synchronized way. Sure, the database querying analogy doesn't extend to zip joins, but I mostly use linq for functional iterations anyway. :)

  • Anonymous
    May 07, 2009
    I like how zip join is defined in Haskell: zipWith          :: (a->b->c) -> [a]->[b]->[c] zipWith z (a:as) (b:bs)            =  z a b : zipWith z as bs zipWith _ _ _                          =  [] 20 lines of code converted into 3! Granted, the Haskell version doesn't check for nulls, but that's because it doesn't have null =).

  • Anonymous
    May 07, 2009
    Nice...  Though in my opinion a drawback of the given implementation is that if the enumerables are of different lengths, it silently drops the extra elements.  Would probably be better to have a test that if either enumerable hits its end, the other one hit its end at the same time.  (and of course you want to behave properly if the inputs have infinite length) Doing so is problematic for common cases. For example, suppose you want to zip a finite sequence against an infinite sequence of integers to form a finite indexed sequence. With your proposal you'd have to iterate the finite sequence twice, once to count it, and once to zip it with a finite index sequence. By allowing slop on either side we allow one of the sequences to be infinite. -- Eric    Something like      using(var e1 = first.GetEnumerator()) {
           using(var e2=second.GetEnumerator()) {
             while(true) {
               var hasMore1=e1.MoveNext();
               var hasMore2=e2.MoveNext();
               if(!hasMore1 || !hasMore2) { //one of them has ended
                 if(hasMore1 || hasMore2) { //one of them has not ended
                   throw new Exception("Enumerables have different lengths?");
                 }
                 break;
               }
               yield return resultSelector(e1.Current, e2.Current);
             }
           }
         }

  • Anonymous
    May 07, 2009
    The comment has been removed

  • Anonymous
    May 07, 2009
    I understand why you've decided against the query comprehension syntax. But it would be great if you could add it to System.Linq.Enumerable so we still have parity with Python.

  • Anonymous
    May 07, 2009
    I would also have checked lengths first. If you check lengths first, you have to iterate both sequences twice. That potentially doubles the cost to every call. -- Eric And would prefer static syntax, but that may just be the influence of F# Love you blog, though I only understand 70% of it (but that's better than a few years ago - so either you're dumbing down, or I'm smarting up)

  • Anonymous
    May 07, 2009
    The comment has been removed

  • Anonymous
    May 07, 2009
    @Kosak I see IEnumerable<T>'s of different lengths as a feature not a bug.. The fact that Eric didn't require the the zip to work on bounded sequences, where the upper bound (Length) is known aprori, is a feature not a bug. It will allow the methods use with unbounded generator based streams. I for one ask, please don't add any checking to make sure the two IEnumerable<T>'s are the same cardinality.

  • Anonymous
    May 07, 2009
    In a slightly more abstract view, Zip and Unzip are really the same function, and that function is Transpose. Zip turns a pair of lists (2xN) into a list of pairs (Nx2). Unzip just transposes the dimensions again.  In APL and derivatives (J, K, etc.) it actually works that way because lists, tuples, and function parameters are either the same thing or interchangeable.

  • Anonymous
    May 07, 2009
    I use two variations of a method like this. Zip is as above - it stops once either enumerable expires. ZipEqual throws an exception if one enumerable expires before the other. It doesn't do any sort of checking in advance, it just throws the exception during iteration when one expires and the other doesn't. I tend to use ZipEqual whenever possible because it makes it clear that the sequences are expected to be the same length. I think I got the name and the behaviour from ML. Still, if you're only going to provide one in the standard library I think it should be the more flexible Zip.

  • Anonymous
    May 07, 2009
    Thank you for submitting this cool story - Trackback from DotNetShoutout

  • Anonymous
    May 08, 2009
    Clearly the query syntax is pared down to some notion of what is most commonly used, but I'm not sure whether or not an operator is supported by the majority of relational databases is a good criteria to base that decision on.  The query syntax is part of Linq, not part of a specific ORM like LinqToSql or the EF.

  • Anonymous
    May 08, 2009
    Interesting Finds: May 8, 2009

  • Anonymous
    May 08, 2009
    I would go with an implementation that matches Ruby for objects: irb> [1,2,3].zip([4,5]) => [[1, 4], [2, 5], [3, nil]] irb> [1,2,3].zip([4,5,6,7]) => [[1, 4], [2, 5], [3, 6]] Either exception out or have nil as the second member of the tuple when the zipee is shorter. Keep the existing behavior for zippor is shorter than the zipee Also this seems a bit more natural and as @OlivierLeclant said in an earlier comment, it is unzippable,   public static IEnumerable<Tuple<TFirst, TSecond>> Zip<TFirst, TSecond> (this TFirst first, TSecond second);

  • Anonymous
    May 08, 2009
    Anthony, I agree, and the point is that the code I posted works perfectly well on unbounded streams, as should have been obvious from both the code and the parenthetical comment in my explanation. Nothing in my code requires that the enumerables are bounded.  The only additional requirement that my code imposes is that IN THE CASE THAT the code encounters the end of one stream, it will check that the other stream terrminates there also.

  • Anonymous
    May 08, 2009
    @Kosak  : I just don't think it is an Exception to have different lengths. Are you telling me you code doesn't throw on different lengths? """ throw new Exception("Enumerables have different lengths?"); """

  • Anonymous
    May 08, 2009
    Can you include how that would be done without using LINQ and possibly a real world scenario where this could be used?

  • Anonymous
    May 08, 2009
    Reviewing what I learned from your excellent post on naming things, I have to admit that Andrey Shchekin's comment is spot on.

  • Anonymous
    May 08, 2009
    Why did you choose a 2-method implementation? What is the need for the private ZipIterator()? Can't you yield return from Zip()? Good question. I covered that already a while back: http://blogs.msdn.com/ericlippert/archive/tags/Psychic+Debugging/default.aspx -- Eric  

  • Anonymous
    May 08, 2009
    Thanks Eric!! I use iterators all the time and was not aware that my null checks (just like in your psychic debugging example) where being deferred... doh!

  • Anonymous
    May 08, 2009
    "However, we decided that adding a query syntax for a relatively obscure operation that is not actually supported by most relational databases was not worth the cost of complicating the grammar and the syntactic translation rules." That's exactly the reason why I think Scala made the right decision to put Linq-similar things in libraries and instead create an expandable language. While Linq surely feels nice it can't be extended by anyone except the C# developers...

  • Anonymous
    May 09, 2009
    The comment has been removed

  • Anonymous
    May 10, 2009
    Eric, will there be an Unfold in version 4? Not to my knowledge. But it would be trivial to write such a beast yourself. How about this? static IE<T> Unfold<S, T>(this S state, Func<S, bool> done, Func<S, S> next, Func<S, T> transform)
    {
      while (true) 
      {
        yield return transform(state);
        if (done(state)) yield break;
        state = next(state);
      }
    } -- Eric

  • Anonymous
    May 10, 2009
    Daniel Lehmann wrote: > That's exactly the reason why I think Scala made the right decision to put Linq-similar things in libraries and instead create an expandable language. While Linq surely feels nice it can't be extended by anyone except the C# developers... Linq is a pattern, anyone can adhere to the Linq pattern, which means that the implementation part of Linq is extensible at the core, the only part of Linq that is not extensible is the comprehension syntax (you can't add new comprehension keywords, but you can make the comprehension syntax translate to entirely different "member" calls, usually methods). Other than adding new language keywords, you can extend Linq in any possible way imaginable (you can completely replace Linq to SQL, or Linq to XML, or Linq to Objects with your implementation, or simply mix other implementations with your own selectively)... The comprehension syntax in linq is simply a loose pattern matching (that works by translating to appropriate "member" calls (not just methods), using regular member resolution rules), the rest of Linq is straight forward libraries with standard CLR types that adhere to the linq pattern, and which can be implemented by anyone.

  • Anonymous
    May 14, 2009
    Re: Unfold, thanks! I already wrote my own in a simpler form where the transform part is left out (just stick a Select after it if you need that). I guess everyone using Linq is writing their own slightly different version of it.

  • Anonymous
    May 19, 2009
    The comment has been removed

  • Anonymous
    March 31, 2010
    I don't claim to know how iterators work in C#, but the way another language implementation throws an iteration exception when you run off the end of the list allows you to simplify the zip operation something like this: function zip(left, right) {  for (;;)    yield [left.next(), right.next()]; }