Sdílet prostřednictvím


Intro to C# and Analyzing Government Data

Hello World!

I have recently been informed that many of my articles may be a bit advanced for folks, so I am going to kick off a series of C# articles dedicated to the Beginner to programming.  I have no idea how long this series is going to be, I’ll just keep adding to it as requests come in for various topics.  This series is meant to take the absolute beginner to a level in which they can possibly derive value from my other articles. 

 This is a pull from my personal blog at: www.indiedevspot.com

Tools

Lets start off with some tools.  For the purposes of this article, there is really only one tool that you need.  Visual Studio.  As of today Visual Studio 2015 is currently in preview, so in the heart of all that is beta, we will start with that immediately.  I imagine Visual Studio 2015 Community to be coming to general availability soon, but no idea when.

The download for Visual Studio is here: https://www.visualstudio.com/en-us/downloads/download-visual-studio-vs.aspx

References

These are the places that I go when I need help figuring something out, but also places you can go to learn.

  1. https://www.microsoftvirtualacademy.com/training-courses/c-fundamentals-for-absolute-beginners
  2. https://stackoverflow.com/
  3. https://msdn.microsoft.com/library
  4. https://github.com/

Terminology

These are some common terms you will hear in the development world.  I want to address some of these right away.

  • Full Stack Developer:   A developer who can code from database to web services to client on a single tech stack
  • Tech Stack: Usually this is .Net, Node, Open Source.  Its the series of technologies that you use to accomplish data storage, web services and UI
  • .Net:  Series of development languages and technologies developed by Microsoft.  C# is a language part of the .Net family, as is F# and VB.  They run in the “CLR”, which is the common language run time or code execution environment these languages run in, which is common across them.
  • Data Storage: This is typically in reference to long term data storage or databases, such as SQL.
  • SQL: A relational database developed by Microsoft for storing tabular data with relationships between the tables.
  • Web Services:  Endpoints on the web that can be accessed that complete tasks and return results.  Typically Restful.
  • Endpoint:  Typically a globally accessible or resolvable address on the internet, such as https://www.mywebsite.com/api/Cars
  • Restful:  Type of internet calls for endpoints.  These are typically Get, Post, Delete.  A series of known verbs control this.  It is typically manifested in a url as: https://www.mywebsite.com/api/Cars/Black/1 (would get a black car with an id of 1 if following normal conventions).
  • Object Oriented: A paradigm of coding in which objects are the primary feature of the language.
  • Object: A section of code that contains data as well as methods to manipulate data.  Manifested as pieces of memory that can be instantiated in your code.  Objects are common in structure, but differ in mutable types.
  • Mutable type:  Data that can be changed that lives in memory, typically in the context of an object.
  • Logic:  The intelligence to your code that manipulates data to produce a desired result.

Getting Started

Start by opening Visual Studio 2015.

On the top left, you should see a series of buttons, click File -> New -> Project

Expand Templates, click on Visual C# and select ConsoleApplication.  Name the application as seen below.

NewConsoleApp

Click OK.

You should now have a view like below…

emptyApp

Lets take a minute to talk about this.

I will explain concepts in need to order.

  1. namespace Tutorial: This defines that code within these curly brackets ({ }) will be referenced by the name Tutorial.
  2. using System; : System is a namespace, we are using the namespace System, such that we can have access to objects within it.
  3. class Program: This is the default class built for us, other code will reference this via Tutorial.Program
  4. static void Main(string[] args): this is the “entry point” of the application.  This is where everything starts.

Now lets start with the super simple “Hello World” before we jump into something way more interesting.

1 2 3 4 5 6 7 8 9 10 11 namespace Tutorial {     public class Program     {         static void Main(string[] args)         {             Console.WriteLine("Hello World!");             Console.ReadKey();         }     } }

Change your code to reflect above (keep your using statements).  This simply will write to the console “Hello World!” and then wait for you to push any key before moving on.

To run the program, on the very top bar in Visual Studio is a “Debug Button”.  Click this to execute your code. It is identifiable by being a green Arrow |> pointing to the right with Start next to it.

debugbutton

You will see a console pop up and state “Hello World”, you can push a button and it will go away.

Doing something Real now.

So now lets do something interesting, like figure out the best and worst place to live in the United States based on the open data on data.gov :)

Download the data from here: https://onedrive.live.com/redir?resid=478A64BC5718DA47\!300\&authkey=\!AP7WdDyZ\_Inbfas\&ithint=file%2cxls

This is real data pulled from data.gov.

First, we need to bring in a library that helps us parse Excel files.

Open Solution Explorer

On the right hand side of the screen is Solution Explorer, if you don’t have it there, it can be found at the top button bar View -> Solution Explorer.

Add the Excel Nuget Package.  Right Click the .csproj file or Tutorial, highlighted in blue here

project File

After right clicking Tutorial, a submenu will pop up, click “Manage Nuget Packages”.  A new dialogue will fill the screen.  In the search text box, type “Excel”.

Click on ExcelDataReader and click Install.

ExcelDataReaderNuget

Close that tab.

Now we are ready to do something useful!

Lets start by adding new using statements that we need to the top of the file:

1 2 3 4 5 using Excel; using System; using System.Collections.Generic; using System.Data; using System.IO;

Lets build a few types…

Within the Tutorial namespace, add the following code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 //Enum Type //These are actually integer values, but can be named. //Useful for switch statements. public enum Stats {     Employed,     Unemployed,     MedianIncome,     LaborForce, } /// <summary> /// Data Class that we can use for holding data. /// </summary> public class AreaStatistic {     /// <summary>     /// Data property of type string, holds the state this area is in.     /// </summary>     public string State { get; set; }     /// <summary>     /// Name of the area within the state.     /// </summary>     public string AreaName { get; set; }     /// <summary>     /// List of statistics for each year of this particular area.     /// It is a List of a Tuple, which is a pair of objects, in which the first     /// object is an integer and the second object is a tuple.     /// The second Tuple is a pair of objects in which the first is a Stat enum and the second is     /// a float?  a float? is a nullable floating point number     /// </summary>     public List<Tuple<int, Tuple<Stats, float?>>> YearlyStats { get; set; }   }

This code defines the data types we will be using for our analysis.  The AreaStatistic defines a class, which, when instantiated is an object.

Lets now write out our code for getting the statistics for a year.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 /// <summary> /// Gets stats for a particular year. /// </summary> /// <param name="r">Data Row you want to get stats for for that year</param> /// <param name="year">Year you want to have data for</param> /// <param name="i">index starting point for that year.</param> /// <returns></returns> public static List<Tuple<int, Tuple<Stats, float?>>> GetStatsForYear(DataRow r, int year, int i) {     //Create the empty object which holds the stats.     List<Tuple<int, Tuple<Stats, float?>>> stats = new List<Tuple<int, Tuple<Stats, float?>>>();     //Begin big region for making sure we deal with empty or null data.     string s = r.ItemArray[i].ToString();     string s1 = r.ItemArray[i + 1].ToString();     string s2 = r.ItemArray[i + 2].ToString();     float? f;     float? f1;     float? f2;     if (string.IsNullOrEmpty(s) || string.IsNullOrWhiteSpace(s))     {         f = null;     }     else     {         f = float.Parse(s);     }     if (string.IsNullOrEmpty(s1) || string.IsNullOrWhiteSpace(s1))     {         f1 = null;     }     else     {         f1 = float.Parse(s1);     }     if (string.IsNullOrEmpty(s2) || string.IsNullOrWhiteSpace(s2))     {         f2 = null;     }     else     {         f2 = float.Parse(s2);     }     //End big area of checking for bad data.       //get data for LaborForce this year     stats.Add(         new Tuple<int, Tuple<Stats, float?>>         (year, new Tuple<Stats, float?>         (Stats.LaborForce, f)));     //get data for # employed this year.     stats.Add(         new Tuple<int, Tuple<Stats, float?>>         (year, new Tuple<Stats, float?>         (Stats.Employed, f1)));     //get data for # unemployed this year.     stats.Add(         new Tuple<int, Tuple<Stats, float?>>         (year, new Tuple<Stats, float?>         (Stats.Unemployed, f2)));     //return the yearly stats.     return stats; }

Wow!  That is a TON of code!  Most of it is error checking, man I am missing F# pattern matching right now…The rest simply gets sections out of the row and adds it to a stats type.  So what calls this code?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 /// <summary> /// Converts a single row into the AreaStatistic Type. /// Note, this is based on the format of the file being known. /// </summary> /// <param name="r">Data Row</param> /// <returns>AreaStatistic</returns> public static AreaStatistic ConvertRowToStat(DataRow r) {     //create an empty stat object we can populate     AreaStatistic stat = new AreaStatistic();     //Get the state code from the data row.     stat.State = r.ItemArray[1].ToString();     //if the state code is not of length 2, it is not a data row.     if(stat.State.Length != 2)     {         //Throw an exception so the code stops here and doesn't continue.         //An exception will be printed to the console due to the line of code that         //wraps this.         throw new Exception("Not data row");     }     //get the area name.     stat.AreaName = r.ItemArray[2].ToString();     //Build an empty object for the yearly stats.     stat.YearlyStats = new List<Tuple<int, Tuple<Stats, float?>>>();     //Stat years are 2000-2013.  The columns start at column 9,     //There are three rows of data and 1 skipped row.     //Therefor we do 9 (for year 2000 and add 4 * the current iteration to that)     for (int i = 0; i < 14; i++)     {         //Add to the list of stats the stats for this year, total of 13 stats.         stat.YearlyStats.AddRange(Program.GetStatsForYear(r, 2000 + i, 9 + (i * 4)));     }     //row 65 has income data.     string s = r.ItemArray[65].ToString();     float? f;     //sometimes data is blank, we need to deal with that.     if (string.IsNullOrEmpty(s) || string.IsNullOrWhiteSpace(s))     {         //representation of no data.         f = null;     }     else     {         //we should be good to parse this.         f = float.Parse(s);     }     //add the data.     stat.YearlyStats.Add(         new Tuple<int, Tuple<Stats, float?>>         (2013, new Tuple<Stats, float?>         (Stats.MedianIncome, f)));     //Return the stats for this area.     return stat; }

Wow, a bunch more code!  Again, notice that 99% of this code is dealing with poorly formed data and also comments.  Notice the amount of code re-use we get out of the GetStatsForYear method.  This is great, always try to extract methods out like this that you can use.  So Finally, we get into the reading of the data in.  This also has been extracted out, so readers can simply download the .xls file, put it where they want, and point the code to where it is on your laptop.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 /// <summary> /// Method we define that allows our Main function to look cleaner.  This reads in the data and returns /// a list of statistics /// </summary> /// <param name="url">location on our drive where the file is.</param> /// <returns>List of statistics for each area.</returns> public static List<AreaStatistic> ReadInData(string url) {     //Initialize the stats object as a new list that holds objects of type AreaStatistics     List<AreaStatistic> stats = new List<AreaStatistic>();     //Using ensures that the reader is disposed of from memory when we are done with it.     using (StreamReader reader = new StreamReader(url))     {         //This is from the ExcelData library we brought in, that parses the excel data for us.         IExcelDataReader excelReader = ExcelReaderFactory.CreateBinaryReader(reader.BaseStream);         //Converts the stream into a dataset type.         DataSet d = excelReader.AsDataSet();         //There is only 1 table, but we iterate as if there were multiples anyways.         foreach (DataTable dt in d.Tables)         {             //Iterate each row in the table.             foreach (DataRow r in dt.Rows)             {                 try                 {                     //add to our stats list the statistic after we convert it.                     stats.Add(Program.ConvertRowToStat(r));                 }                 catch(Exception e)                 {                     //if something goes wrong, write it to the console.                     //tell me what happened.                     Console.WriteLine(e.Message);                 }             }         }     }     //return the list of stats     return stats; }

This doesn’t seem so bad.  You can see the catching of the exception, this is where if any transient missed exceptions get thrown will be caught such that our program won’t just up and crash on us.  The downside is that if an exception is thrown due to bad data, that row will not be added to our stats, and instead the console will write out the message of what happened.

Finally, we take a look at our code entry point:

1 2 3 4 5 6 7 8 9 10 11 /// <summary> /// Entry point to the program. /// </summary> /// <param name="args">Command line arguments passed in.</param> static void Main(string[] args) {     //We populate the type "stats" with data we read from the .xls file.     List<AreaStatistic> stats = Program.ReadInData("C:\\Unemployment.xls");     //just lets us see some output.     Console.ReadKey(); }

From here, you can see our high level task, we wanted to read in the statistics into this “stats” object.  We now have a list of “stats” that we can do some real analysis with :).

The next article will focus on doing interesting things with this data for more insights into it.  Maybe we can see if we can verify the claims about unemployment rates.  Maybe there are interesting insights from this data we can see from a collection standpoint, or growth rates in total labor vs employed vs unemployed vs income.

Please use the comments section for any questions you might have.  I know this might be a bit more advanced than I intended.  The full code is pasted below…

Summary

We have covered some C# basics, as well as our first dive into analyzing government data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 using Excel; using System; using System.Collections.Generic; using System.Data; using System.IO;   namespace Tutorial {     //Enum Type     //These are actually integer values, but can be named.     //Useful for switch statements.     public enum Stats     {         Employed,         Unemployed,         MedianIncome,         LaborForce,     }     /// <summary>     /// Data Class that we can use for holding data.     /// </summary>     public class AreaStatistic     {         /// <summary>         /// Data property of type string, holds the state this area is in.         /// </summary>         public string State { get; set; }         /// <summary>         /// Name of the area within the state.         /// </summary>         public string AreaName { get; set; }         /// <summary>         /// List of statistics for each year of this particular area.         /// It is a List of a Tuple, which is a pair of objects, in which the first         /// object is an integer and the second object is a tuple.         /// The second Tuple is a pair of objects in which the first is a Stat enum and the second is         /// a float?  a float? is a nullable floating point number         /// </summary>         public List<Tuple<int, Tuple<Stats, float?>>> YearlyStats { get; set; }       }     /// <summary>     /// Primary program that lets us do stuff.     /// </summary>     public class Program     {         /// <summary>         /// Entry point to the program.         /// </summary>         /// <param name="args">Command line arguments passed in.</param>         static void Main(string[] args)         {             //We populate the type "stats" with data we read from the .xls file.             List<AreaStatistic> stats = Program.ReadInData("C:\\Unemployment.xls");             //just lets us see some output.             Console.ReadKey();         }         /// <summary>         /// Method we define that allows our Main function to look cleaner.  This reads in the data and returns         /// a list of statistics         /// </summary>         /// <param name="url">location on our drive where the file is.</param>         /// <returns>List of statistics for each area.</returns>         public static List<AreaStatistic> ReadInData(string url)         {             //Initialize the stats object as a new list that holds objects of type AreaStatistics             List<AreaStatistic> stats = new List<AreaStatistic>();             //Using ensures that the reader is disposed of from memory when we are done with it.             using (StreamReader reader = new StreamReader(url))             {                 //This is from the ExcelData library we brought in, that parses the excel data for us.                 IExcelDataReader excelReader = ExcelReaderFactory.CreateBinaryReader(reader.BaseStream);                 //Converts the stream into a dataset type.                 DataSet d = excelReader.AsDataSet();                 //There is only 1 table, but we iterate as if there were multiples anyways.                 foreach (DataTable dt in d.Tables)                 {                     //Iterate each row in the table.                     foreach (DataRow r in dt.Rows)                     {                         try                         {                             //add to our stats list the statistic after we convert it.                             stats.Add(Program.ConvertRowToStat(r));                         }                         catch(Exception e)                         {                             //if something goes wrong, write it to the console.                             //tell me what happened.                             Console.WriteLine(e.Message);                         }                     }                 }             }             //return the list of stats             return stats;         }           /// <summary>         /// Converts a single row into the AreaStatistic Type.         /// Note, this is based on the format of the file being known.         /// </summary>         /// <param name="r">Data Row</param>         /// <returns>AreaStatistic</returns>         public static AreaStatistic ConvertRowToStat(DataRow r)         {             //create an empty stat object we can populate             AreaStatistic stat = new AreaStatistic();             //Get the state code from the data row.             stat.State = r.ItemArray[1].ToString();             //if the state code is not of length 2, it is not a data row.             if(stat.State.Length != 2)             {                 //Throw an exception so the code stops here and doesn't continue.                 //An exception will be printed to the console due to the line of code that                 //wraps this.                 throw new Exception("Not data row");             }             //get the area name.             stat.AreaName = r.ItemArray[2].ToString();             //Build an empty object for the yearly stats.             stat.YearlyStats = new List<Tuple<int, Tuple<Stats, float?>>>();             //Stat years are 2000-2013.  The columns start at column 9,             //There are three rows of data and 1 skipped row.             //Therefor we do 9 (for year 2000 and add 4 * the current iteration to that)             for (int i = 0; i < 14; i++)             {                 //Add to the list of stats the stats for this year, total of 13 stats.                 stat.YearlyStats.AddRange(Program.GetStatsForYear(r, 2000 + i, 9 + (i * 4)));             }             //row 65 has income data.             string s = r.ItemArray[65].ToString();             float? f;             //sometimes data is blank, we need to deal with that.             if (string.IsNullOrEmpty(s) || string.IsNullOrWhiteSpace(s))             {                 //representation of no data.                 f = null;             }             else             {                 //we should be good to parse this.                 f = float.Parse(s);             }             //add the data.             stat.YearlyStats.Add(                 new Tuple<int, Tuple<Stats, float?>>                 (2013, new Tuple<Stats, float?>                 (Stats.MedianIncome, f)));             //Return the stats for this area.             return stat;         }         /// <summary>         /// Gets stats for a particular year.         /// </summary>         /// <param name="r">Data Row you want to get stats for for that year</param>         /// <param name="year">Year you want to have data for</param>         /// <param name="i">index starting point for that year.</param>         /// <returns></returns>         public static List<Tuple<int, Tuple<Stats, float?>>> GetStatsForYear(DataRow r, int year, int i)         {             //Create the empty object which holds the stats.             List<Tuple<int, Tuple<Stats, float?>>> stats = new List<Tuple<int, Tuple<Stats, float?>>>();             //Begin big region for making sure we deal with empty or null data.             string s = r.ItemArray[i].ToString();             string s1 = r.ItemArray[i + 1].ToString();             string s2 = r.ItemArray[i + 2].ToString();             float? f;             float? f1;             float? f2;             if (string.IsNullOrEmpty(s) || string.IsNullOrWhiteSpace(s))             {                 f = null;             }             else             {                 f = float.Parse(s);             }             if (string.IsNullOrEmpty(s1) || string.IsNullOrWhiteSpace(s1))             {                 f1 = null;             }             else             {                 f1 = float.Parse(s1);             }             if (string.IsNullOrEmpty(s2) || string.IsNullOrWhiteSpace(s2))             {                 f2 = null;             }             else             {                 f2 = float.Parse(s2);             }             //End big area of checking for bad data.               //get data for LaborForce this year             stats.Add(                 new Tuple<int, Tuple<Stats, float?>>                 (year, new Tuple<Stats, float?>                 (Stats.LaborForce, f)));             //get data for # employed this year.             stats.Add(                 new Tuple<int, Tuple<Stats, float?>>                 (year, new Tuple<Stats, float?>                 (Stats.Employed, f1)));             //get data for # unemployed this year.             stats.Add(                 new Tuple<int, Tuple<Stats, float?>>                 (year, new Tuple<Stats, float?>                 (Stats.Unemployed, f2)));             //return the yearly stats.             return stats;         }         } }

Comments

  • Anonymous
    April 21, 2015
    Some nice dirty data there. Can't help but agree with you're comment "man I am missing F#" :) and couldn't help thinking this would be a lot easier in F#, so here it is: http://fssnip.net/qE

  • Anonymous
    April 22, 2015
    I love it!  Isn't that so much more clean and awesome?