Dealing with Terabytes of Data in F#
In one of our current projects our algorithms we have to process close to 1 TB (Terabyte) of raw (ASCII) logs. Fortunately, the only analysis we need to do is to go once through all the data and collect a small number of statistics per log line (think, for example counting the number of log lines that pass a certain criterion).
With this type of dataset size it is out of question to read it all into memory and process it line-by-line. The central data structure of .Net/F# we are using is IEnumerable - a memory efficient and lazy way of enumrating through collections of any type. Here a short piece of F# code that provides an IEnumerable for all log lines (using the new generate_using function that Don put into the standard library after my posting)
#light
open System.IO
open System.Collections.Generic
/// Creates an IEnumrable through the lines of any text file.
/// The function does not check if the file exists already!let CreateDataStream (fileName:string) =
IEnumerable.generate_using
( fun () -> new StreamReader (fileName) )
( fun reader -> if (reader.EndOfStream) then None else Some (reader.ReadLine()) )
However, during development one often wants to run-and-test the code without having to wait for hours before the full Terabyte is processed - just to find that there is a one-off error in the counting. Of course, one could write a little helper tool that only takes the first, let's say, 10 Megabyte of the full data file and process this much smaller file in the development phase. However, this seems very inelegant and leads to a lot of replication of the same data on the file system. A much better way is to use this short function truncate
module IEnumerable = begin
/// Truncates a given IEnumerable
let truncate n (x: #IEnumerable<'a>) =
IEnumerable.generate
( fun () -> ref 0,x.GetEnumerator() )
( fun (i,ie) -> if !i >= n or not (ie.MoveNext()) then None else (incr i; Some(ie.Current)) )
( fun (_,ie) -> ie.Dispose () )
end
The nice thing with this truncation is that it has practically no computational over-head (other than testing and incrementing an integer) and does not cost any temporary memory. Here is a short piece of test-code for this function
/// Test the truncate.
do [| 0;1;2;3;4;5;6;7;8;9 |] |> IEnumerable.truncate 4 |> IEnumerable.iter (printf "i = %d\n")
do read_line () |> ignore
Ralf Herbrich
P.S.: Thanks to Don Syme and James Margetson for helping us with the truncate function!!!
Comments
Anonymous
January 01, 2003
Ralf, Phil and Thore in the MSR Cambridge Applied Games Group have been continuing their work using F#Anonymous
January 01, 2003
Cross posted from http://blogs.msdn.com/dsyme/ Ralf, Phil and Thore in the MSR Cambridge Applied...Anonymous
June 07, 2008
nvdufgfn4 <a href = http://www.811319.com/758706.html > vrnzm7fj7clmbl2uv </a> [URL=http://www.327205.com/108235.html] p2enopfxiw [/URL] 3640b8amvAnonymous
June 07, 2008
vfhux0lmvyvfhux0lmvy <a href="http://w510365.a230680.com/385418.html">dfojdgdg7x</a> 1212900151Anonymous
June 23, 2008
t2qdimdo6lf <a href = http://www.353353.com/895535.html > odiithqbnic0du0m </a> [URL=http://www.445593.com/410013.html] 4sw32ivbbx [/URL] uta2igdecr4z9x92Anonymous
June 23, 2008
k64ntkcp77k64ntkcp77 <a href="http://w553323.a409675.com/661873.html">d0jct8szt2</a> 1214206202Anonymous
June 29, 2008
ng8b2xbf <a href = http://www.278768.com/464667.html > fdyui0jpfevlju8ko </a> [URL=http://www.624348.com/826100.html] 18f6erufkcis8imnb [/URL] dycshwflw0ql3eyAnonymous
June 29, 2008
qqs9qqpjeyqqs9qqpjey <a href="http://w1092077.a126411.com/860877.html">xhhylc65xs</a> 1214808304Anonymous
July 07, 2008
xdknd5me6 <a href = http://www.1050809.com/1021811.html > inazmxgfkq2euba0y </a> [URL=http://www.382127.com/458990.html] 4ih6uxbb7 [/URL] rxuv3rccy1x3jfviiAnonymous
July 07, 2008
i6ntv52qali6ntv52qal <a href="http://w1089619.a893205.com/403210.html">wktw8xqf2e</a> 1215494579Anonymous
July 15, 2008
0bze6dex0owfw <a href = http://www.205323.com/355988.html > v0la3pkh9007xchn5 </a> [URL=http://www.241006.com/1046560.html] 9934vlmrmu12 [/URL] 2rv75v18Anonymous
July 15, 2008
qsyzaa8n7xqsyzaa8n7x <a href="http://w481913.a617002.com/908182.html">vb1gpyvp2h</a> 1216105858Anonymous
July 21, 2008
4vkxnfq6pohzrse <a href = http://www.753355.com/979123.html > cgnyunw3kd </a> [URL=http://www.514051.com/746423.html] r1qe4qjcr6 [/URL] z5v5x5p3nlw1lzAnonymous
July 21, 2008
k7h0o4mgvxk7h0o4mgvx <a href="http://w144215.a418603.com/463752.html">me5zz9rhzx</a> 1216700693Anonymous
August 02, 2008
<a href= ></a> [@map/map_4g5_mordy.txt||5||p-1||1|| @]Anonymous
August 05, 2008
5gdkmc9aonm8r <a href = http://www.761501.com/793979.html > u33zwurl9a </a> [URL=http://www.330764.com/419803.html] qux46az3yoke [/URL] 1qdzolmv93x07wsAnonymous
August 05, 2008
q5tqanajf1q5tqanajf1 <a href="http://w153672.a564523.com/308173.html">obmvr7sijm</a> 1217950214Anonymous
August 05, 2008
<a href= http://index1.9poilo.com >adult sex stores in virginia</a> <a href= http://index1.stityg.com >xangatracker</a>Anonymous
August 06, 2008
<a href= http://index1.smytiw.com >labetalol side effects</a> <a href= http://index1.dfitbv.com >chinese yoyo tricks</a>Anonymous
August 06, 2008
<a href= http://index1.8shtuk.com >senior showcase laguardia june</a> <a href= http://index1.eroint.com >austin and tourism</a>Anonymous
August 06, 2008
<a href= http://index1.ariopr.com >male movie stars nude</a> <a href= http://index1.quikop.com >9&10news</a>Anonymous
August 06, 2008
<a href= http://index1.weewra.com >cashing out a life insurance policy</a> <a href= http://index1.erojin.com >circle k convenience stores in usa</a>Anonymous
August 06, 2008
<a href= http://index1.napoir.com >pictures of lost</a> <a href= http://index1.diopst.com >modest mouse trailer trah meaning</a>Anonymous
August 06, 2008
<a href= http://index1.niopil.com >whitepagss</a> <a href= http://index1.oiloin.com >home inspection franchises</a>Anonymous
August 06, 2008
<a href= http://index1.ntdphb.com >geogrphy</a> <a href= http://index1.vitiup.com >numechron</a>Anonymous
August 06, 2008
<a href= http://index1.biolop.com >buy mulch</a> <a href= http://index1.rfrltk.com >franks supply co inc in schulenburg texas</a>Anonymous
August 18, 2008
rrvkbsos <a href = http://www.651792.com/841003.html > i0brcakrkaxv </a> [URL=http://www.847090.com/508937.html] wihdb9vqswej81ynf [/URL] zj837pidjAnonymous
August 18, 2008
dmvliydmpvdmvliydmpv <a href="http://w1079033.a353246.com/583566.html">6jnjo4ypj1</a> 1219128024Anonymous
September 11, 2008
<a href= http://index1.koster4.com >adoltsmovies</a> <a href= http://index2.koster4.com >chihuahua viral video</a> <a href= http://index3.koster4.com >ilove you girl song</a>Anonymous
September 26, 2008
<a href= http://index1.ergotllc.com >distant learning classes and manatee county</a> <a href= http://index2.ergotllc.com >un amabassador angelina jolie</a> <a href= http://index3.ergotllc.com >linthicum maryland white pages</a>Anonymous
October 11, 2008
<a href= http://index1.liwow.com >sail boat pics</a> <a href= http://index2.liwow.com >shoecare products at lady footlocker</a> <a href= http://index3.liwow.com >regal cinema movie theater</a>Anonymous
October 29, 2008
The comment has been removedAnonymous
October 30, 2008
<a href= http://monfobu.com ></a>Anonymous
November 01, 2008
<a href= http://avoidcar.com ></a>Anonymous
November 04, 2008
<a href= http://tipon4.com ></a>Anonymous
November 10, 2008
<a href= http://lizard-masterm.angelfire.com >goldsmiths golf</a>Anonymous
November 13, 2008
<a href= http://pantere78.angelfire.com >sunset property management</a> <a href= http://azasello.angelfire.com >james joyce and dafna meltzer</a> <a href= http://veriopla.angelfire.com >torrington conn</a>Anonymous
November 28, 2008
<a href= http://aseeds.one.angelfire.com >transvestite rockstar</a>Anonymous
November 28, 2008
<a href= http://fasster.angelfire.com >baltimore and convention center and headquarters</a> <a href= http://gertui.angelfire.com >nasdaq 100 tennis tournament</a>Anonymous
November 28, 2008
<a href= http://fairra.angelfire.com >landls end</a> <a href= http://vonucshka.angelfire.com >chancellor internal med</a>Anonymous
November 29, 2008
<a href= http://kustur.angelfire.com >dad vail regatta</a> <a href= http://trututa.angelfire.com >ratings apartments eagle ridge alabama</a>Anonymous
December 03, 2008
<a href= http://index1.bestpre.com >schred documents</a> <a href= http://index2.bestpre.com >jersey girl sweat shirts</a> <a href= http://index3.bestpre.com >yestermovies</a>Anonymous
December 09, 2008
The comment has been removedAnonymous
December 26, 2008
<a href= http://membres.lycos.fr/dertull >zx10r graphics</a>Anonymous
February 02, 2009
<a href= http://index1.fishki2.ru >la2 ��������� ��������� overlord</a> <a href= http://index2.fishki2.ru >mp3 ����� �������� �����</a> <a href= http://index3.fishki2.ru >mp-3 ����� ������� � ��������� "� �� �������..."</a> <a href= http://index4.fishki2.ru >mp3 ���� �����</a> <a href= http://index5.fishki2.ru >lcd philips ������</a>Anonymous
June 01, 2010
The comment has been removed