StreamInsight for Non-Programmers
Credits
[Roman Schindlauer* and* J Sawyer* contributed to this article]*
Introduction
Microsoft StreamInsight consists of a set of programming tools, and most of what is written about StreamInsight is written specifically for programmers.
But what if you are, for example, a database administrator or data analyst without an extensive programming background? You're wondering if StreamInsight might be a solution for a problem you have, but the existing documentation leaves you scratching your head.
Hopefully this article will help. It offers an overview of StreamInsight from a non-programmer's point-of-view, with the goal of helping you evaluate what StreamInsight is and what it can do.
Streams of Data
Just as Microsoft SQL Server is designed to manage static data, StreamInsight is designed to analyze dynamic data. To StreamInsight, a stream is a sequence of data that has time associated with it. Examples would be a stock ticker stream that provides the prices of different stocks in an exchange as they change over time, or a temperature sensor stream that provides temperature values reported by the sensor over time.
A StreamInsight program passes the stream through a set of queries that analyze the data, watching for interesting information. It then outputs information derived from the queries, such as an alert that was generated because a query identified an anomaly.
You can read a more detailed description of the components of StreamInsight in StreamInsight Concepts.
Queries
Queries are the heart of StreamInsight. You can define one or more queries that pick through the data of a moving stream, looking for interesting values or patterns. For example, you might define a query over a stock ticker stream that watches for large and rapid fluctuations in a particular stock price.
The query language used in StreamInsight is a variation of LINQ (Language-Integrated Query). Similar to the way T-SQL allows you to query SQL databases, StreamInsight LINQ allows you to query streaming data.
For example, in SQL Server you might have a database table named "Tollbooth" that contains information on cars that passed through a tollbooth over the past week. Using T-SQL, you could select the cars that had Washington state license plates using a query like this:
select *
from Tollbooth
where LicenseState = "WA"
Now suppose that instead of a static database of historical tollbooth readings you had live data coming from the tollbooth in real time. In this case you could use StreamInsight to watch for any license plate from Washington and identify it immediately. You could find these cars with a very similar-looking LINQ query:
var myQuery = from car in TollStream
where car.LicenseState = "WA"
select car;
However, because a data stream also includes a time component, StreamInsight LINQ gives you the additional ability to compose time-related queries. For example, suppose you wanted to count the number of cars that pass through the tollbooth in a 3 minute window, updating that count every minute:
var countStream = from win in tollStream.HoppingWindow(TimeSpan.FromMinutes(3),
TimeSpan.FromMinutes(1))
select win.Count();
This query uses a "hopping window" - a window 3 minutes long that "hops" forward every minute. Here's a picture of how this works:
In the example figure, you can see that there were 3 cars in the the first 3-minute window, 2 in the next, and so on. Not only would a query like this be much more complex in T-SQL, but, as opposed to SQL, this count is being performed in real time as the cars are passing through the tollbooth.
Here is the crucial difference between StreamInsight and a database: StreamInsight never stores the data anywhere! Instead, the query is kept active at all times, and every new reading (we call them “events”) will immediately, upon arrival, trigger a new computation and generate a new result. The StreamInsight computation engine keeps just as much data in memory as necessary - in the example above, the counter is kept for the duration of the 3-minute window. Every new incoming event causes an incremental update of StreamInsight’s internal data structures, and, if the query semantics require it, an output event. So even though the StreamInsight LINQ queries look very much like SQL queries, their execution (once deployed to the engine) happens very differently: event-driven and continuously.
An excellent tutorial on queries is A Hitchhiker's Guide to StreamInsight Queries. The technical details of StreamInsight LINQ can be found in the MSDN documentation under Using StreamInsight LINQ.
Moving Data In and Out
A StreamInsight program brings incoming data to the queries using a data source, and the output of the queries is delivered through a data sink. A programmer creates these sources and sinks using functions available in StreamInsight.
Data stream sources can receive data from manufacturing applications, financial trading applications, Web analytics, operational analytics, etc. - even a static database can be used as a source to provide reference information. Data stream sinks could be pagers, monitoring devices, KPI dashboards, trading stations, or databases – any device or system that needs to use what the query has found.
A StreamInsight program can have multiple data sources, and queries can combine these sources using standard operations such as joins and unions. Similarly, a StreamInsight program can have multiple data sinks, accepting data from one or more queries.
Clients and Servers
StreamInsight is based on a client-server architecture. The client program defines the data source, the queries, and the data sink, and copies these definitions to the server. The server then does the ongoing work of running the queries over live data.
The advantage of this system is that the server that actually does the real-time work can be running on a dedicated system designed for that type of computational load. Also, you can have multiple clients copying data source, query, and data sink definitions to the same server, and then these components can be shared on the server. For example, multiple queries from different clients could be performing different analyses over data from a single shared source.
Another advantage is still to come: Microsoft is developing a StreamInsight server that will run in Windows Azure. This means that you will be able to choose to have the ongoing work of data movement and analysis performed in the cloud. You can follow the progress of this project (currently codenamed Austin) on the StreamInsight Blog - watch for posts such as this announcement from a few months ago: StreamInsight in Windows Azure: Austin February CTP.
If you want more detail on how this client-server architecture works, you can read Deploying StreamInsight Entities to a StreamInsight Server (you can just ignore the code examples).
Want More Information?
If you want to dig a little further and get more details on StreamInsight, you might start with [[Get Started with Streaminsight 2.1]]. Included in that article are links to some case studies of companies using StreamInsight, as well as a good training course.
You also can post questions or comments here or in the StreamInsight Forum.