Architecture Design Question of the Week

Here is the scenario...

Every day you get a text file with a number of financial records in a fixed length format. Your job is to identify any new or changed records from the previous day and be able to produce an output file in the same format containing only the new or changed records. You can make no assumptions about the ordering of records in the file.

Here is a sample of the data

1310|512|086048610|01/01/1996|WB| |12/31/9999|1290.00 |USD5 |

1310|512|110000011|06/10/2002|WB| |12/31/9999|100.00 |USD5 |

1310|512|110000111|06/10/2002|WB| |12/31/9999|100.00 |USD5 |

 

The data files can get quite large (3GB)

Question for you: How would you architect the solution for this to achieve optimal performance?

Comments

  • Anonymous
    September 25, 2006
    In practicing architecture...

    My first question would be: Which is the business problem that needs to be solved here?

    Then: Why do I need to spot the new or changed records?

    Another: What task is accomplished with these financial records in the data file?

  • Anonymous
    September 25, 2006
    Good job Marcod - you win the first design prize.  Good architects are always looking at the larger context because it shapes the solution in important ways.  Where this data comes from and where it is going are two questions that I don't know the answer to (yet) but hope to have something soon.  For example it might be advantageous to load this data into a SQL database and allow it to do the heavy lifting of searching and sorting.  On the other hand it might not be worth the extra time and effort.  Hard to say without knowing more.  Stay tuned.

  • Anonymous
    September 25, 2006
    The comment has been removed

  • Anonymous
    September 26, 2006
    The comment has been removed

  • Anonymous
    September 26, 2006
    So, is there a memory constraint?  Would I be able to consume 6GB of RAM to process these files?

    Also, are we able to change the application that produces these files to add a timestamp to each record so we don't even have to look at the previous day's file?

  • Anonymous
    September 26, 2006
    I like the idea of timestamps.  I don't know if we can alter the thing that generates the files though but would most certainly be on my list of asks. The current system has been run on 2 servers.  One has 3GB of RAM and one has 1GB of RAM.  Though considering the cost of hardware I would not hesitate to recommend a hardware upgrade for the solution.

  • Anonymous
    September 26, 2006
    The comment has been removed

  • Anonymous
    September 26, 2006
    Oooh - more interesting approaches and comments since I started on mine. From reading them, it occurs to me that it might be worth me showing my system specs so the timings mean something:

    Athlon 1600
    512MB RAM (+10GB swap)
    Single 80GB 7200RPM HD

    Also, I forgot to point out that only commands 3 and 8 are needed in the final script.

  • Anonymous
    September 26, 2006
    The comment has been removed

  • Anonymous
    September 27, 2006
    The comment has been removed

  • Anonymous
    September 27, 2006
    The comment has been removed

  • Anonymous
    September 28, 2006
    PingBack from http://asailorskis.com/arch/?p=61

  • Anonymous
    October 17, 2006
    The way I would do it owuld be: I would create two tables with the same columns (the columns would be the same as in the text fle). Lets Say Table1 and Table2 You could use either Access or SqlServer(Firebird, PostgreSql,Sqlite etc) I would create an odbc datasource for the text file (for SqlServer) Or a simple procedure with TransferText (for Access). Lets sa Every day I would delete the data from table 2, and insert it from the text file via the odbc connection for SqlServer or via the procedure in Access. The data from the previuos day would allready be in Table1 Then I would apply aquery like Select Table1.Field1,Table1.Field2 ... Table2.Field1, Table2.Field2 from Table1 Right Outer Join Table2 on Table1.Field1=Table2.Field1,Table1.Field2=Table2.Field2 etc These would give you all the records from the second table (today), and for the records that are not present in the table1 table (yesterday) you would have all nulls. So You can take the new records. The output file could be generated through another odbc link for SQLServer or through a procedure from Access. After this you could replace the data in table1 with the data in Table2 for tomorrow

  • Anonymous
    October 18, 2006
    My Solution: Some things to consider.

  • The fastest time can be achieved if the number of reads from files are reduced.
  • Important thing to note: at least one of the files can be read sequentially and only once. The second might need rereading previously read records from disk depending on the amount of available memory.
  • we need to split our files into pages. Pages are of fixed size and can be read whole, also they can be read directly by simply seeking calculated page position in the file. Every page contains a fixed number of records.
  • we need to determine the maximum number of pages we can store in memory.
  • the idea is to read the firs file sequentially in small chunks discarding previous read data and also read the second one sequentially but also index every record found and keep as much as possible in memory.
  • We also need a list containing pages with records read from the second file, used as a MRU (most recent used) list Algorithm:  - Create and indexing structure contains records key and page where the record can be found. The index can be created with the appropriate size from the beginning and no dynamic resizing will be required.  - Start reading the first file a small chunk at a time. Now we need to try to find a matching record from the second file for every record in our current chunk read from the first file.    IF record number not in INDEX then        - Read the second file a page at a time until the record is found. The difference is that previous pages are not discarded. Every new page read is pushed into the MRU page list. When the page is read every record is indexed: it's key and page are inserted into index. If the maximum size of the list is reached the last page is discarded.    ELSE       IF Page already in memory            - Compare records and bring page to the beginning of the MRU list. If no more records not compared in current page discard it       ELSE Read page from file2 and discard last page from MRU list if exceeding max   size. Scenarios: 1 Worst case:
  • every record from every page in the fist file has the equivalent one, in a different page on the second one, so basically at worst you won't have page hits until all remaining pages cam be stored in memory.
  • best case scenario all equivalent records are on the same pages which means every file is read only once and the records compared. Boy I sure hope my English is good enough for somebody to understand what I was trying to say. :)