Just in case any one needs a Observable Collection that deals with large data sets, and supports FULL EDITING...
Virtualizing Data with XAML
Overview
The Observer design pattern has been is one of the lynchpins of almost every modern software platform for years. In fact Observer is the core of the MVC/MVVM design patterns. In .NET, ObservableCollection is one of the most frequently-used objects in any system that involves data binding (which is pretty much any non-trivial application).
As great as ObservableCollection is, it does have some limitations:
- It doesn’t implement any kind of data virtualization, making large or unbound data sets challenging to work with.
There are some frameworks and tools out there that provide some form of virtualized observable collection, but typically these come with limitations like only working with controls from the same framework, or requiring certain coding paradigms to function properly. They’re not a general-use solution. Plus, as far as we know there is no virtualized observable collection that actually supports virtualized read/write, but rather they use a “read and reload” pattern which can be very inefficient with large data sets.
So, what to do about it? We made one ourselves – the VirtualizingObservableCollection, which does the following:
- Implements the same interfaces and methods as ObsevableCollection<T> so you can use it anywhere you’d use an ObservableCollection<T> – no need to change any of your existing controls.
- Supports true multi-user read/write without resets (maximizing performance for large-scale concurrency scenarios).
- Manages memory on its own so it never runs out of memory, no matter how large the data set is (especially important for mobile devices).
- Natively works asynchronously – great for slow network connections and occasionally-connected models.
- Works great out of the box, but is flexible and extendable enough to customize for your needs.
- Has a data access performance curve so good it’s just as fast as the regular ObservableCollection – the cost of using it is negligible.
- Works in any .NET project because it’s implemented in a Portable Code Library (PCL).
Interested? Read on.
First, a bit about ObservableCollection
We have all seen the myriad of examples of using ObservableCollections with ItemControls – ListBoxes and ComboBoxes for example, such as:
<ListBox x:Name="lb" ItemsSource=”{Binding Path=MyData}” Grid.Row="0">
<ListBox.ItemTemplate>
<DataTemplate>
<TextBlock Text="{Binding Path=Name}" />
</DataTemplate>
</ListBox.ItemTemplate>
</ListBox>
Where the ViewModel contains something like:
public class SimpleViewModel
{
private ObservableCollection<SimpleDataItem> _MyData = null;
public ObservableCollection<SimpleDataItem> MyData
{
get
{
if(_MyData == null)
{
_MyData = new ObservableCollection<SimpleDataItem>();
foreach (var o in SimpleDataSource.Instance.Items)
{
_MyData.Add(o);
}
}
return _MyData;
}
}
}
You just fill the data up in the ObservableCollection from the data source, and off you go. Perfect, right?
Well, not so fast. It’s a great solution if the ObservableCollection is a manageable size, but in most large scale systems the data can be thousands or even hundreds of thousands of rows. Loading that volume of data will seriously slow down the application. If the call is to a remote server then it’s even worse – that simple load, can take minutes, or even hours. Mobile devices, with limited CPU, memory and bandwidth compound the problem even further.
Clearly this is not a workable solution for these cases (which pretty well every one of our projects seems to have).
So how are people dealing with big data sets today? Typically we see a few common solutions:
- Implement a ‘page next/previous’ system in the UI. So, this way we only load the number of items in the list. This is the most common large data UI – but its seriously flawed when the data we want it on page 231 – the user has to press the ‘next’ button 231 times.
- Limit the number of rows returned. Well that’s not very usable, and how do we tell the user that the list limit has been reached.
- Extending 2, we could limit the list, but then have a “load more” option at the end. This is better, and is a UI paradigm that we see frequently on mobile devices. But again, ifs the data is on page 250, users will have to press “load more” 249 times.
Those are three the most common solutions used today. There is a fourth, which is a degree more sophisticated:
4. Create a pagination system inside the collection, so it appears to be the full data set, but it loads chunks (or pages) at a time as needed in the background. That way you have the same UI whether you have a hundred data items or a hundred million.
The problem this approach is that to date, the available implemented solutions are read-only (except the framework providers such as Telerik that do support paging and CRUD (Create, Read, Update, Delete) operations in their own special ways. But these but typically only on their own controls.
If we want to update the collection with changes in the data set, we effectively have to reset/refresh the entire list and fetch the data again. Every paginating collection we’ve found follows this same pattern (see the end of this article for more details about that).
If there’s only one user, or the data doesn’t change very often, this may be okay. However, in multi-user and/or high-data-frequency scenarios, where lots of users are making changes to the underlying data store (like a SQL database, for example) or the data is changing very often (like a near-realtime monitoring system), this becomes a serious problem because the only way to ensure you have up-to-date data is to reload, causing your visuals to flicker as they’re reloaded, scrolling lists back up to the top, and losing your current selection, not to mention the cost of needing to get all the pages of data all over again.
This is less than ideal solution. You can get around this by using Behaviors that intercept the “reset” and do tricky things with the visuals, scroll position and selection to give the impression of seamlessness, but these are ugly hacks we’ve all used for years. What would be better would be a solution to the underlying problem.
That’s what VirtualizingObservableCollection is – an ObservableCollection that paginates, and that you can update just like an ordinary ObservableCollection. That’s a real solution to this problem.
Where to get it
The latest package can be found on nugget. Install-Package VirtualizingObservableCollection. We will be putting unp the source for it shortly, as well as a comprehensive example.
How to use it
VirtualizingObservableCollection is implemented in a PCL so you can use it in all kinds of .NET solutions. Consequently, some of the Async work it does in the background needs to be able to call back to the dispatcher thread. Also the, page reclamation system needs to run on a periodic basis, so you need to set up some sort of timer to set those up.
Setting up the VirtualizationManager
Both of these are platform-specific needs, so you’ll need to write (a tiny bit) of boilerplate code for the VirtualizationManager to use. However you only needs to do it once.
Here’s is an example in WPF:
public MainWindow()
{
InitializeComponent();
//this routine only needs to run once, so first check to make sure the
//VirtualizationManager isn’t already initialized
if (!VirtualizationManager.IsInitialized)
{
//set the VirtualizationManager’s UIThreadExcecuteAction. In this case
//we’re using Dispatcher.Invoke to give the VirtualizationManager access
//to the dispatcher thread, and using a DispatcherTimer to run the background
//operations the VirtualizationManager needs to run to reclaim pages and manage memory.
VirtualizationManager.Instance.UIThreadExcecuteAction =
(a) => Dispatcher.Invoke(a);
new DispatcherTimer(
TimeSpan.FromSeconds(1),
DispatcherPriority.Background,
delegate(object s, EventArgs a)
{
VirtualizationManager.Instance.ProcessActions();
},
this.Dispatcher).Start();
}
//the rest of the constructor goes here
}
Now we have set that up, we need to create a service that fills up the data. We don’t want to do fetch all the data all at once, so we’ll use something more like a callback service, called by our virtualizing collection as needed. We do this by implementing IPagedSourceProvider<T> interface.
Here is the same thing for Xamarin.Forms
public class App
{
public static Page GetMainPage()
{
if(!VirtualizationManager.IsInitialized)
{
VirtualizationManager.Instance.UIThreadExcecuteAction = (a) => Device.BeginInvokeOnMainThread(a);
Device.StartTimer(TimeSpan.FromSeconds(1), () => {
VirtualizationManager.Instance.ProcessActions();
return true;
});
}
return new ContentPage
{
Content = new Label
{
Text = "Hello, Forms !",
VerticalOptions = LayoutOptions.CenterAndExpand,
HorizontalOptions = LayoutOptions.CenterAndExpand,
},
};
}
}
Implementing a provider
IPagedSourceProvider<T> defines three methods (two of which are optional) and one property:
- public PagedSourceItemsPacket<T> GetItemsAt(int pageoffset, int count, bool usePlaceholder)
This returns a list of items (in the IEnumerable<T> Items property) starting at item pageoffset, for a maximum of count items (ignore the usePlaceholder argument, that’s used internally by the system).
BIG NOTE – The pageoffset is the number of items into the Enum NOT the page number, this is to allow pages to be different sizes. See the how it works section.
- public int IndexOf(SimpleDataItem item)
This returns the index of a specific item. This method is optional – you can just return -1 if you don’t need to use IndexOf. It’s not strictly required if don’t need to be able to seeking to a specific item, but if you are selecting items implementing this method is recommended. - public void OnReset(int count)
This is a callback that runs when a Reset is called on a provider. Implementing this is also optional. If you don’t need to do anything in particular when resets occur, you can leavt this method body empty. - public int Count/
This is an integer with the total number of items in the data set.
So, let’s look at our super simple example:
public class TesterSource : IPagedSourceProvider<SimpleDataItem>
{
public PagedSourceItemsPacket<SimpleDataItem> GetItemsAt(int pageoffset, int count, bool usePlaceholder)
{
return new PagedSourceItemsPacket<SimpleDataItem>()
{ LoadedAt = DateTime.Now,
Items = (from items in SimpleDataSource.Instance.Items select items).Skip(pageoffset).Take(count)
};
}
public int Count
{
get { return SimpleDataSource.Instance.Items.Count; }
}
public int IndexOf(SimpleDataItem item)
{
return SimpleDataSource.Instance.Items.IndexOf(item);
}
public void OnReset(int count)
{
}
}
And Finally – using the provider in the VirtualizingObservableCollection
Once we have our IPagedSourceProvider set up, we can create a VirtualizingObservableCollection and use it just like we do a normal ObservableCollection:
public class SimpleViewModel
{
private VirtualizingObservableCollection<SimpleDataItem> _MyDataVirtualized = null;
public VirtualizingObservableCollection<SimpleDataItem> MyDataVirtualized
{
get
{
if (_MyDataVirtualized == null)
{
_MyDataVirtualized =
new VirtualizingObservableCollection<SimpleDataItem>(
new PaginationManager<SimpleDataItem>(new TesterSource()));
}
return _MyDataVirtualized;
}
}
}
As you can see from the above example, we are actually creating a few objects. Let’s unwrap it to see how they’re related:
- The IPagedSourceProvider<T> (TesterSource) accesses the underlying data source, providing essential information about the data source like the number of items.
- The PaginationManager<T> wraps the IPagedSourceProvider<T> and manages it. Most of the heavy lifting for pagination, memory management and data insertion is done here.
- The VirtualizingObservableCollection<T> wraps the PaginationManager<T> and can be used anywhere you can use an ObservableCollection<T>.
The PaginationManager constructor has a number of optional parameters to make it more flexible:
public PaginationManager(
IPagedSourceProvider<T> provider,
IPageReclaimer<T> reclaimer = null,
IPageExpiryComparer expiryComparer = null,
int pageSize = 100,
int maxPages = 100,
int maxDeltas = -1,
int maxDistance = -1,
string sectionContext = ""
)
- IPageReclaimer<T> reclaimer
The default implementation uses PageReclaimOnTouched<T> which removes pages based on the last time they were accessed. However, there may be times when you want to use a different strategy for reclaiming pages. In those cases, you can override the IPageReclaimer<T> with one of your own. - IPageExpiryComparer expiryComparer
This is a really useful system that allows you to check the update stamp of the page v.s. the update stamp of the update itself (so it does not reapply the update to a page that has already got that update). There is an implementation using DateTime in DateBasedPageExpiryComparer.
- int pageSize
The default size of a page is 100 items. You can change that to any positive integer you like. - int maxPages
By default, the VirtualizingObservableCollection will begin to reclaim pages once there are 100 pages in memory. You can change this to any positive integer you like. - int maxDelta
The default is -1, which means the VirtualizingObservableCollection will not release once a delta limit is exceeded. A forthcoming post about how the VirtualizingObservableCollection works under the hood will detail this argument further. - int maxDistance
The default is -1, which means the VirtualizingObservableCollection does take distance into account. A forthcoming post about how the VirtualizingObservableCollection works under the hood will detail this argument further. - string sectionContext
The default is an empty string. A forthcoming post about how the VirtualizingObservableCollection works under the hood will detail what this argument does.
Once we have our collection we can quite happily do anything we want to, for example:
public void InsertAt(int index, SimpleDataItem newItem)
{
SimpleDataSource.Instance.Items.Insert(index, newItem);
this.MyDataVirtualized.Insert(index, newItem);
}
public void RemoveAt(int index)
{
SimpleDataSource.Instance.Items.RemoveAt(index);
this.MyDataVirtualized.RemoveAt(index);
}
Yes, that’s an actual write operation on a data-virtualized observable collection, and once the edit is performed, whether done locally, via some messaging system such SignalR, or when someone else has modified the underlying data source, the VirtualizedObservableCollection won’t reset, so you won’t have to worry about hiding flicker, re-scrolling or caching selections.
Using UpdateAt when you are sharing a database
When you are using latent calls, and updates from other clients, such as a SQL database, it’s a great idea to check if the update should be applied to your copy of the virtual collection. This is done at a page level (hence the LoadedAt member on the PagedSourceItemsPacket – simply return DateTime.Now, or even better store the TimeStamp in the row in the database and return LoadedAt = (from I in Items select TimeStamp).Max();.
Then, pass in the Expiry implementation you wish to use, for example: new
VirtualizingObservableCollection<SimpleDataItem>(new PaginationManager<SimpleDataItem>(new TesterSource(), expiryComparer: DateBasedPageExpiryComparer.DefaultInstance)**);
To aid with this, when you get an update from another client (for example from SignalR), use the extended CRUD operators e.g. InsertAt(int index, T newItem, object UpdatedAt) where UpdatedAt is the DateTime of the insert operation. That way, your copy will always be in perfect sync with the database
Understanding how it works
Before we can really understand how it works, we need to understand how controls such as ListBox call the interfaces.
When you have a control that has an ItemsSource: Typically it uses IList<T> (specifically the T this[in index] getter to populate the items, with an occational IndexOf thrown in. By using a sparse array of pages, we only cache a limited number of elements, and can release them if the page cache fills up. That way, the collection can look like its any size – but we keep a smaller subset of data as its traversed over. This is the Dictionary<int page, ISourcePage<T> _Pages collection in the page manager.
These pages that the PagingManager stores, are by default, one is ‘pageSize’ long (default is 100). This is typically how all the read only virtualizing collections do it. What we realized was that we needed a way of being able to change the page size for pages that have had CRUD operations applied. BUT – we needed an effient method to work out what page and offset into the page a given index actually is contained at. This is the secret to this implementation – we use a combination of recursive seeks to get the page and offset rather than having to work out which page by summing the page length of EVERY page that is before it. We have also introduced optimized paths where the control is asking for the very next item, or the current item.
Lets explain this a little more:-
We keep a Dictionary<int page, PageDelta> _Delta
This means that we can compute a logical adjustment as in var adjustment = (from deltas in _Deltas.Values where deltas.PageNumber<page select deltas.Delta).Sum()
So, basically we are only storing the pages that are non-conforming in size.
Now:- it’s not quite as simple as applying (page*pageSize)+adjustment Depending on if the adjustment is positive or negative we must recurse if the adjustment is positive, taking into account any adjustments for the pages it effectively offsets into. This is a small price on performance, but these initial seeks are infrequent compared with single increment seeks (basically it goes [100] [101] etc), which are optimized so they do not require recursion, or access to any of the previous deltas.
If you want to understand the specifics, the seeking is all contained in the CculateFromIndex method in the PaginationManager class.
We should also mention the IPageReclaimer<T> interface, and more specifically the class PageReclaimOnTouched which is the default implementation used for the PaginationManager. On a periodic basis, if the total number of pages held in the cache exceeds the maxPages limit, then it will attempt to remove and release the overflow pages by removing those pages in the order where LAST accessed pages are oldest.
So, we get the best of both worlds – huge collections that load fast, efficient memory model, small data packets (pages), and the ability to edit the observable collection.
Working Async:
If you want to implement it using a (mostly) Async pattern – Simply implement IPagedSourceProviderAsync which introduces three new methods:
Task<PagedSourceItemsPacket<SimpleDataItem>> – This is basically the same as GetItemsAt, with the same parameters – only its designed to run (and be called) async.
Task<int> GetCountAsync() – This is basically the same as the Count property, only in this case the count is got Async.
GetPlaceHolder(int page, int offset) – GetPlaceholder is used to return an object that is used in place of the real data while the paged data is still being fetched.
So: for our example we can implement something like:
public class TesterSourceAsync : TesterSource, IPagedSourceProviderAsync<SimpleDataItem>
{
public async Task<PagedSourceItemsPacket<SimpleDataItem>> GetItemsAtAsync(int pageoffset, int count, bool usePlaceholder)
{
await Task.Delay(1000);
return new PagedSourceItemsPacket<SimpleDataItem>() { LoadedAt = DateTime.Now, Items = (from items in SimpleDataSource.Instance.Items select items).Skip(pageoffset).Take(count) };
}
public async Task<int> GetCountAsync()
{
await Task.Delay(1000);
return SimpleDataSource.Instance.Items.Count;
}
public SimpleDataItem GetPlaceHolder(int index, int page, int offset)
{
return new SimpleDataItem() { Name = "Waiting [" + page + "/" +offset + "]" };
}
}
When does it use async, and when does it drop back to non-async:-
All normal get operations used by ItemSources will use the Async versions, so it reports an initial count of zero, then once the GetCountAsync completes it issues a reset, with the updated count. As data items are requested, it puts in placeholders for the items, and when the GetItemsAtAsync completes, it replaces those items with the real data values.
It falls back to the non-async calls in the following conditions:
- When you get an enumerator- so if you issue a ‘foreach(var item in myVOC) ‘ so you get the real count and the real data (instead of placeholders).
- When you issue a ‘write’ operation and the data is not paged in, in this case it uses the non async GetItemsAt (so it puts the item in correctly).