Baby Names, Nameless Keys, and Mumbling
Baby Names
I recently finished reading Freakonomics. It is a fascinating book about a number of strange phenomena. Its topics range from the economics of dealing crack to cheating in sumo wrestling. Among the sundry topics is a discussion concerning the psychology and sociology underlying babies names.
This topic has interested me ever since I found out that my wife and I gave our first two children some of the most popular names for their birth year. We did not intentionally look for popular names, but we picked them any way. I always wondered how it was that I picked with the crowd. I had theories but I didn't have any data.
So when I heard Levitt and Dubner's theory about why some baby names are so popular, I was naturally very curious. Their hypothesis is that affluent and successful families set the trend (not celebrities). Some baby names begin to take hold in affluent circles. Then when less successful parents are looking for a baby name, they choose the names of children of privilege and opportunity. Thus the name continues to extend its influence from one family to another. Eventually, everyone is giving their child the name (often misspelling it). Finally as more and more of the common folk use the name, the elitists stop using the name.
Their theory seemed probable enough and they had some good data to back it up, but I had my doubts. I certainly didn't feel like I picked my son's name because I thought some other child's opportunities would rub off on him.
Later on, I was in need of an app to test Linq to Objects query performance over a large dataset. Unfortunately, we didn't have a suitable app and I didn't have a lot of time. Furthermore, I didn't have a large easily accessible in-memory dataset to work with. So I decided to write a quick little app to screen scrape the Social Security Administration's Popular Baby Names site. The app pulled down the top one hundred most popular names for every year in every state by gender since the year 1960. I ended up with 40 megabytes of XML where each element looked something like this:
<PopularName state="Alaska" year="1960" rank="5" gender="Female" name="Susan" count="50" />
I then wrote an app that loaded all of the data into memory. Each XML element became a PopularName object which has a property for each attribute in the XML. These names were stored in a local variable called names of type List<PopularName>.
I then wrote a number of queries against the baby name dataset. One of the queries shows the number of children named that name by year. This query is run by calling NameUsage and passing in names and the name to use in the query.
NameUsage(names, "Wesley");
Where the body of the method looks like:
static void NameUsage(IEnumerable<PopularName> names, string searchName)
{
Console.WriteLine("{0} Usage", searchName);
var q = from name in names
where name.Name == searchName
group new { name.Name, name.Count } by name.Year
into g
orderby g.Key ascending
select new { Year = g.Key, TotalCount = g.Sum(x => x.Count) };foreach (var item in q) Console.WriteLine(item);
}
This particular query displays:
Wesley Usage
{ Year = 1960, TotalCount = 107 }
...
{ Year = 2005, TotalCount = 159 }
Here is a graph of the data for the usage of the name "Wesley".
So it seems that my parents were victims of their time as well. But is it only me and my children?
Apparently not. My wife was also given her name during its period of popularity. Note that these names are not necessarily really popular names. Even so, the giving of various names seems to ebb and flow. It is fascinating to think about how this behavior emerges from the vast array of parents seeking the best name for their child.
Nameless Keys
Another one of queries that I wrote listed the most popular names overall. I wanted to distinguish names by gender usage (Terry, Pam, ...) but how can we do that with queries?
What I really want is to make the equality of the names based on the name itself and the gender usage. In C# 3.0, we added anonymous types. These neat little guys are very useful and one of the ways that they are the most useful is as composite keys.
Anonymous types have structural equality semantics within an assembly. This means that if two anonymous types are created with the same property names in the same order and of the same type then they will be the same type and if the values are the same then the two instances will be equal and have the same hashcode.
We can use these facts to write queries which define equality between objects on several values. In this case equality depends on the name and the gender. So in the group...by clause we will group not on the name but on an anonymous type with members corresponding to the name and to the gender of the item.
static void TopNames(List<PopularName> names, int count)
{
Console.WriteLine("Top {0} Names", count);
var q = (from name in names
group name.Count by new { name.Name, name.Gender }
into g
let TotalCount = g.Sum()
orderby TotalCount descending
select new { g.Key.Name, g.Key.Gender, TotalCount })
.Take(count)
.Select((x, rank) => new { Rank = rank + 1, x.Name, x.Gender, x.TotalCount });foreach (var item in q) Console.WriteLine(item);
}
Anonymous types can be used as composite keys in other query clauses such as join and orderby. We can also use them to add multi-argument support to our memoize function. We will use them as multi-argument keys in the map contained in the memoization function.
public static Func<A, B, R> Memoize<A, B, R>(this Func<A, B, R> f)
{
var map = new Dictionary<???,R>();
return (a, b) =>
{
var tuple = new { a, b };
R value;
if (map.TryGetValue(tuple, out value))
return value;
value = f(a, b);
map.Add(tuple, value);
return value;
};
}
Memoize takes a function of two arguments of types A and B respectively. It also returns a function of two arguments of types A and B. What is different is that when the two arguments are passed into the lambda then they are put in a tuple and then it checks the map to see if that tuple is already in the map. Essentially, we are using the anonymous type to form a composite key of the two arguments passed to the lambda.
Mumbling
But how can we create a Dictionary from an anonymous type to type R? While it is easy to specify the type of map using the contextual keyword var even though the type doesn't have a speakable name, it isn't obvious how to specify the type parameters to the Dictionary constructor when we want to instantiate the type to an anonymous type.
We can get around this problem by introducing a new helper class.
static class DictionaryHelper<Value>
{
public static Dictionary<Key, Value> Create<Key>(Key prototype)
{
return new Dictionary<Key, Value>();
}
}
Here we put the type parameters that we can name (Value in this case) on the helper class. Then we create a static method in the helper class that takes the remaining type parameters (Key in this case) but also takes one parameter of the same type for each type parameter. This is so we can pass in parameters that will be used by type inference to infer the unspeakable type parameters. we therefore do not need to specify these types.
This means that we can replace the map creation in Memoize with the following code.
var map = DictionaryHelper<R>.Create(new { a = default(A), b = default(B) });
We specify one of the type parameters (R) of the Dictionary explicitly, but we specify the other (an anonymous type) implicitly by providing an example of the type.
I love using anonymous types as composite keys because they define equality and hashing semantics in terms of their members. So next time you need a composite key, try using anonymous types.
In any case, now that my wife and I are expecting our third child, I have been writing a number of queries against this dataset to understand the ebb and flow of baby names.
Comments
Anonymous
February 10, 2007
Kind of related... kind of. I'm guessing there's now some serious hundreds of thousands of C# 3 written internally at Micrsoft. Out of this body of experience has there emerged yet any stylistic good practice for the use of the 'var' key word? Generally speaking should one use it anywhere where one can? Or should one take the "only where absolutely neccessary" approach? Since I've been spending time playing with F# I tending towards the where ever possible end of the spectrum. What are the thoughts of the C# team?Anonymous
February 12, 2007
I am not sure what other teams consider their best practice with respect to 'var' usage, but I love using var. In fact, I tend to use it unless there is a reason not to use it. But I also don't tend to think it takes away from the readability of the code since I think it puts the focus on what the variable is as opposed to its type. Also, if you have tooling support like Visual Studio then you can hover over var and it will show you the type of the variable and of course intellisense will give you the members of the variable. Furthermore, the compiler will give you error messages if you have a type checking error. So I really like it and use it a lot. Does anyone else have a different opinion?Anonymous
February 13, 2007
Thanks for your excellent explanations. But one question does arise: Why not have a composite TYPE constructor in the language? So you could create a dictionary: var map = new Dictionary<{A,B},R>(); or var map = new Dictionary<<A,B>,R>(); It is good to have anonmous types, but it would be even better to have a way of naming tham John or Jane Doe.Anonymous
February 13, 2007
That is the natural question isn't it. Probably the number one request we have been getting with respect to anonymous types is ironically the ability to name them. The solution that you propose is nice but it does leave out some issues: Two anonymous types are the same type within an assembly if they have the same property names in the same order of the same type. So really we would need something like: var map = new Dictionary<{A a, B b},R>(); So that both the type and the name of each property is captured since it forms part of the type signature.Anonymous
February 13, 2007
When I did see this blogpost : Baby Names, Nameless Keys, and Mumbling , I also decided to "scrape" theAnonymous
February 19, 2007
Welcome to the twenty-first Community Convergence. I'm Charlie Calvert, the C# Community PM, and thisAnonymous
February 21, 2007
I'm going to second the request for named structs. It would be very helpful to just have a shorter syntax for defining structs. new Record{Name = "John"; Age = new int?(23)} would implicitly define the following: public struct Record { private _name; // etc } and... new {Name = "John", Age = new int?(23)} != Record{Name = "John"; Age = new int?(23)} The reason is that I'm trying to generate an assembly using CodeDOM that instantiates instances of anonymous types in the assembly that generates it. What I'm trying to do is add pattern matching to C#. First question: Is this possible with anonymous types at all? I assume they are private but is there any special permissions trickery I could use to do it. After all an assembly containing private types should be able to give special permission to use them to an assembly it creates, no? Second question (not necessarily related to LINQ, but related to first question): Is it even possible to generate an assembly which references the assembly that generates it. When I attempt to do so I get a weird error that it can't find some temporary dll file. I'm including a reference to the current assembly in the ReferencedAssemblies collection of the CompilerParameters object by passing it thisAssembly.Location which appears to contain the correct filename for the dll. If this is too off-topic perhaps you could direct me to a CodeDOM guru that could help me?Anonymous
February 21, 2007
That is very interesting that you want shorter syntax for defining structs. At one point we were looking very seriously at interface initializers. So if you had a "data" interface (no methods or events) that looked something like this: interface IFoo { int X { get; set; } string Y { get; } } Then you could implicitly create an implementation of this interface by doing the following: new IFoo { X = 5, Y = "bar" } These classes would have Equals, GetHashCode, and ToString defined on them. There were a few unresolved issues so we haven't done them yet but I personally would absolutely love to get something like this in. I don't think we ever talked about doing something with structs specifically. 1st Question: You should be able to instantiate instances of the anonymous types. Yes, they are internal to the assembly and they look like just another type except that their names are mangled. You could you friend assemblies or reflection to do it. With respect to CodeDOM, I am not an expert there so I will ask around and get back to you. 2nd Question: You've got me here, I don't have a clue. It seems like you should be able to. I will find an expert for you.Anonymous
February 26, 2007
Jafar: Ok, I got into contact with the CodeDOM guys. Send me an email with your question and I'll loop in the correct crowd.Anonymous
February 27, 2007
Wes, thank you for the article. So what was the performance of Linq on large object collections? Any stats? Regards, MarcelAnonymous
February 27, 2007
Marcel: Great question. Linq performs very well on large object collections. It is typically within 5% of the performance of the equivalent code without Linq. That is fantastic since the code is much much simpler and easier to maintain. Furthermore, with providers that are coming out like PLinq we will soon be able to implicitly use concurrency in Linq to Objects like queries which will boost the performance.Anonymous
February 27, 2007
Wes, Any chance of a status update on PLINQ from either yourself or Joe Duffy? PLINQ is for my shop (and I'm sure others) a huge part of the LINQ value statement. Is PLINQ essentially a Visual Studio Miami feature? Kind regards, tomAnonymous
February 27, 2007
I'll get more definitive information from Joe but until then... As far as I know, PLinq is planning on shipping with Orcas+1. I know that concurrency and PLinq are very important to my team (C# language, compiler, IDE). Hope that helps.Anonymous
February 27, 2007
Cheers Wes. Joe's kind of 'gone dark' on PLINQ recently, I know it's very early days given that LINQ itself hasn't shipped. LINQ and FP have really gotten me thinking about pure functions, side effects and what not in my C# coding. It'll be such a value add when we can take bits of existing LINQ To Objects code and have it executed oncurrently. As soon as there's even a sniff of a PLINQ TAP or EAP program can you let us know? Your customers have a hunger for this stuff :-)Anonymous
February 27, 2007
I will let you know as soon as I know. If you have been thinking about side effects and pure functions then I think you are going to like my next post.Anonymous
March 16, 2007
The comment has been removedAnonymous
April 05, 2007
Overview In the spirit of keeping my first post short and simple, i plan to write about changes to AnonymousAnonymous
May 25, 2007
Loved the post. Very interesting , you have inspired me to do a similar acttivityAnonymous
November 24, 2007
We are so impressed by the stats, we have decided to name our next offspring after you.Anonymous
March 15, 2008
We are more the victims of our time than we imagine. Selecting what we think are unique baby names only to discover that we are running with the crowd is shocking, but it is also happening with our diets and health. We are ingesting the same toxins and getting hit by the same epidemic events, like the rapid rise in diabetes. We are the victims of corporate planning - yes, indeed, a conspiracy. Don't say that too loud or you will be put away in a padded cell. The companies that produce the fatty foods also have a weight loss division. It is simply profitable, and we are the unthinking raw material for corporate success stories. Sleep well, now.