Creating a “Real World” Database in Azure for Advanced Analytics Exploration: Part 2

發行項
06/05/2015

In the previous article in this series, I wrote about provisioning the database in Azure and performing the download of the data from the City of Chicago crime statistics.

In this post, I will discuss transforming the initial data into the various tables necessary to perform meaningful analysis. Keep in mind that the steps I'm laying out here are not the only way to accomplish the task at hand, I'm just providing a relatively simple and straightforward approach to creating data structures that can later be used/consumed by other tools that will also be discussed in later posts in this series.

If you haven't performed the steps in the previous article, you'll want to do that first to have the basic environment available in order to follow the steps laid out here.

Creating the Basic Table Structure

Since the end-state of this database is to provide a structure for efficient queries and analysis, we'll stick with a simple denormalized design. We will also be creating some tables that are purpose-built for some specific queries, so there will be duplicated data within the database we're creating. This should by no means be seen as best practices, but rather as a way to build data structures that support specific use-cases.

With all of the above said, the database structure will look like this:

This design is primarily focused on providing efficient query operations and there are likely many different opinions on how the data could be better structured. Again, for the purpose of this series of blog posts we're not going to focus on best practices, but rather on getting the job done.

This database design is known in the Data Warehouse world as a star schema. The schema consists of a central Fact Table and surrounding Dimension Tables. In short the fact table contains the information we're interested in, and the dimension tables contain the descriptors for that information.

Creating the Fact Table (ChicagoCrimes)

The heart of the database will be the ChicagoCrimes fact table. For the purposes of this blog post, we're going to create a table that uses the same names in the columns as the original data that we downloaded. This will mean creating columns that have spaces in the name, something to which I am generally averse, however it makes it easier to understand how all the pieces fit together in the final puzzle. To create this table, connect to your database instance in Azure using SQL Server Management Studio (See the instructions in the first post if you don't know how to do this) change to the database you created, and execute the following TSQL script:

CREATE TABLE [dbo].[ChicagoCrimes]

(

[ID] [int] NOT NULL PRIMARY KEY CLUSTERED,

[Case Number] [varchar](50) NOT NULL,

[Date] [datetime] NOT NULL,

[Block] [varchar](50) NULL,

[IUCR] [varchar](10) NULL,

[Primary Type] [varchar](50) NULL,

[Description] [nvarchar](max) NULL,

[Location Description] [varchar](50) NULL,

[Arrest] [varchar](10) NULL,

[Domestic] [varchar](10) NULL,

[Beat] [varchar](10) NULL,

[District] [int] NULL,

[Ward] [varchar](10) NULL,

[Community Area] [int] NULL,

[FBI Code] [varchar](10) NULL,

[X Coordinate] [int] NULL,

[Y Coordinate] [int] NULL,

[Year] [int] NULL,

[Updated On] [datetime] NULL,

[Latitude] [float] NULL,

[Longitude] [float] NULL,

[Location] [varchar](50) NULL,

[DateOnly] [date] NULL

);

This should execute very quickly. The above script creates the ChicagoCrimes table, marking the ID column as the Primary Key which will also create a clustered index based on the ID column.

Create the Remaining Tables

The next step is to create the remaining tables. Execute the following TSQL script:

CREATE TABLE [dbo].[Community]

(

[Community_id] [int] NOT NULL PRIMARY KEY CLUSTERED,

[Community] [varchar](30) NULL,

[Area] [varchar](20) NULL,

[Side] [varchar](15) NULL,

[Communityad] [varchar](50) NULL,

[WeatherStation] [varchar](4) NULL

);

CREATE TABLE [dbo].[DateDimension]

(

[PKIDDate] [varchar](50) NOT NULL PRIMARY KEY CLUSTERED,

[Date] [date] NOT NULL UNIQUE,

[Day] [int] NULL,

[DaySuffix] [varchar](50) NULL,

[DayOfWeek] [int] NULL,

[DayOfWeekName] [varchar](50) NULL,

[DOWInMonth] [int] NULL,

[DayOfYear] [int] NULL,

[WeekOfYear] [int] NULL,

[WeekOfMonth] [int] NULL,

[Month] [int] NULL,

[MonthName] [varchar](50) NULL,

[Quarter] [int] NULL,

[QuarterName] [varchar](50) NULL,

[Year] [int] NULL,

[StandardDate] [date] NULL,

[HolidayText] [varchar](50) NULL,

[Season] [varchar](50) NULL

);

CREATE TABLE [dbo].[PoliceStations]

(

[District] [int] NOT NULL PRIMARY KEY CLUSTERED,

[Address] [varchar](255) NULL,

[City] [varchar](50) NULL,

[State] [varchar](50) NULL,

[Zip] [varchar](50) NULL,

[Website] [varchar](500) NULL,

[Phone] [varchar](50) NULL,

[Fax] [varchar](50) NULL,

[TTY] [varchar](50) NULL,

[Location] [varchar](255) NULL

);

CREATE TABLE [dbo].[SocioEconomicFactors]

(

[Community Area Number] [int] NOT NULL PRIMARY KEY CLUSTERED,

[Community Area Name] [varchar](50) NULL,

[Percent of Housing Crowded] [float] NULL,

[Percent Households Below Poverty] [float] NULL,

[Percent Aged 16+ Unemployed] [float] NULL,

[Percent Aged 25+ Without High School Diploma] [float] NULL,

[Percent Aged Under 18 or Over 64] [float] NULL,

[Per Capita Income] [float] NULL,

[Hardship Index] [int] NULL,

);

The above script will create each of the primary tables needed in the database. The next step is to create the constraints that will ensure referential integrity between the tables.

Create the Foreign Key Constraints

This step is not really necessary in order to have a functional database, but it will provide a framework that ensures that the data stored in the fact table is properly related to the data in the dimension tables. While we're not focusing on perfection in modeling this particular database, since we are going to be focused on using the database for analysis purposes, it's important to do what we can to ensure that the data we do analyze is at least consistent with the model we build.

To create the Foreign Key constraints, execute the following TSQL script:

ALTER TABLE ChicagoCrimes ADD

CONSTRAINT FK_Community FOREIGN KEY ([Community Area]) REFERENCES dbo.Community (Community_id),

CONSTRAINT FK_DateDim FOREIGN KEY ([DateOnly]) REFERENCES dbo.DateDimension ([DATE]),

CONSTRAINT FK_PoliceStation FOREIGN KEY (District) REFERENCES dbo.PoliceStations (District),

CONSTRAINT FK_Socio FOREIGN KEY ([Community Area]) REFERENCES dbo.SocioEconomicFactors ([Community Area Number]);

Once the constraints are added, the dimension tables can be populated.

Populating the Dimension Tables

In a star schema, the dimension tables provide descriptive data about related information in the fact table. The dimension tables will need to be populated before the fact table due to the foreign key constraints that were just added.

Populating the Date Dimension

Arguably the most important dimension of any star schema is the date dimension. The more descriptive information that can be provided about a date in the dimension table, the more interesting your analysis can be. For simplicities sake, I will provide an export of the date dimension that I use most often, which has been populated with many descriptive columns and seasons related to the northern hemisphere. You can download the .csv file here. (Save the file to your local drive)

Once the file is downloaded, start the SQL Server Import Export Wizard (refer back to the first article in this series if you need a refresher on how to start the wizard)

Click Next and choose Flat File Source from the drop down menu. Select the file that you just downloaded in the File name textbox. Uncheck the Column Names in First Data Row checkbox.

Click Advanced on the left side, and change the Data Type to Unicode String [DT_WSTR] on columns 0,3,5,11,13,16 and 17 (this is to ensure the data types are converted properly when loaded into the dimension table)

Click Preview on the left side to verify that the file is being properly read.

Choose Next and then select SQL Native Client .. in the Destination drop down and then enter your server connection information, then select the ChicagoCrime database:

Click Next and ensure the DateDimension table is selected in the Destination field:

Click Next and ensure the Convert checkbox is selected for every row that requires a conversion (denoted by a caution icon on the left side of the dialog):

Click Next twice and then click Finish to load the data into the DateDimension table. You should see 54787 rows loaded.

Populating the Community Dimension

The Community dimension table is used to relate the area that the crime report occurred in to a community in Chicago. This table also has geographic breakdowns (North Side, South Side, etc) as well as a relevant weather station (that will be used in later posts in this series). You can download the data file for the Community dimension here (save the file to your local machine).

Once the file is downloaded, start the SQL Server Import/Export Wizard and choose Flat File Source from the source drop down. Choose the file that you just downloaded, and uncheck the Column names in the first data row checkbox. Put a double-quote (") in the Text qualifier textbox.

Choose Advanced on the left side and change the data type of Column 0 to 4 byte unsigned integer [DT_4UI] and change the remaining column datatypes to Unicode String [DT_WSTR]

Choose Preview on the left side to verify that the columns are properly recognized

Choose Next and then select SQL Server Native Client xx in the Destination drop down and then enter your server information. Choose ChicagoCrime in the Database drop down.

Click Next and make sure that the Community table is shown in the Destination

Click Next twice and then Finish to populate the table. You should see 78 rows transferred.

Populating the PoliceStations Dimension

The PoliceStations dimension is used to relate the area that a crime occurred to the police precinct that services the area. You can download the data file for the PoliceStations dimension here (Save the file to your local drive).

Click Advanced on the left side. Change the data type of Column 0 to 4 byte unsigned integer [DT_UI4] and change the remaining column data types to Unicode string [DT_WSTR] . Change the Output column width on columns 5 and 9 to 255.

Click Preview on the left side to ensure that the columns are correctly displayed.

Click Next and select SQL Native Client xx in the Destination drop down. Enter your server connection information and logon credentials, then select the ChicagoCrimes database from the Database dropdown.

Click Next and verify that the PoliceStations table is listed in the Destination.

Click Next twice and then Finish to load the PoliceStations table. You should see 23 rows transferred.

Populating the SocioEconomicFactors Dimension

The SocioEconomicFactors dimension is used to describe the social and economic factors of a given neighborhood in Chicago. This will be useful in later analysis of the crime data.

You can download the data file for the SocioEconomicFactors dimension here (Save the file to your local drive).

On the left side, select Preview to ensure the columns are properly displayed.

Click Next and make sure that the SocioEconomicFactors table is listed in the Destination. (You will have to select the SocioEconomicFactors table in the drop down since the name will be incorrect when the list is initially populated)

Click Next twice and then Finish to load the table. You should see 78 rows transferred.

When the transfer is complete, you will have loaded all of the dimension tables and are ready to load the fact table.

Populating the Fact Table

In a star schema design, the fact table is the table that contains all of the information that will be used to analyze the database. This is the largest table of the entire design (In the case of this database, it will contain more than 5.5M rows).

The process of populating the fact table involves writing a TSQL query that will read data from the temporary table that was created previously (follow the steps outlined in Part 1 of this series if you do not have the temporary table) and write the data to the ChicagoCrimes table. This is a very simple process and will make use of implicit conversion to convert from the default data types that the temporary table uses to the data types configured in the ChicagoCrimes table.

The first step in populating the ChicagoCrimes table is to connect to the database and open a new query window. Follow the steps in the first article in this series if you are unfamiliar with doing this. Once you are connected to the database, in the new query window execute the following query:

INSERT ChicagoCrimes

SELECT [ID]

,[Case Number]

,[Date]

,[Block]

,[IUCR]

,[Primary Type]

,[Description]

,[Location Description]

,[Arrest]

,[Domestic]

,[Beat]

,[District]

,[Ward]

,[Community Area]

,[FBI Code]

,[X Coordinate]

,[Y Coordinate]

,[Year]

,[Updated On]

,[Latitude]

,[Longitude]

,[Location]

,CAST([Date] AS DATE)

FROM [dbo].[Crimes_New]

ORDER BY [Date]

The above query will read the data from the Crimes_New table (which was populated using data from the City of Chicago data portal in the previous article in this series) and then populate the new ChicagoCrimes table, using implicit conversion for the data types. The reason there is an Order By clause in the query is to ensure that the crime data is read in sequential order by date. This query will take several minutes to complete as there should be more than 5.5 million rows of data in the Crimes_New table.

If you loaded the dimension tables properly, you will find that the query completes with no errors and you will end up with a fully-populated database. You can run some queries against the ChicagoCrimes table like:

SELECT

dd.Season AS [Season]

,dd.DayOfWeekName AS [DayOfWeek]

,cc.[DateOnly] AS [IncidentDate]

,CAST(cc.[Date] AS TIME) AS [IncidentTime]

,cc.[Case Number] AS CaseNo

,cc.[Primary Type] AS CrimeType

,cc.Description AS [CrimeDescription]

,cc.[Location Description] AS [CrimeLocation]

,cc.Latitude AS [Latitude]

,cc.Longitude AS [Longitude]

,cc.Beat AS [Beat]

,c.Community AS [Community]

,c.Area AS [Area]

,c.Side AS [Side]

FROM ChicagoCrimes cc

JOIN [DateDimension] dd

ON cc.DateOnly = dd.[Date]

JOIN [dbo].[Community] c

ON cc.[Community Area] = c.Community_id

WHERE cc.[Date] > (GETDATE()-30);

Which will return a list of crimes that have been reported within the last 30 days (assuming that you have downloaded the latest data from the City of Chicago within the last 30 days)

Conclusion

In this post we continued the process of building a usable database for analysis by implementing a star-schema design and populating both the dimension and fact tables. We then executed a query to generate a Crime Report for crimes within the last 30 days in order to test the data.

In future posts I will detail the creation of indexes and views that will help make analysis of this data easier, and will also discuss tools such as Excel and Power BI for analysis.

Comments

Anonymous
June 08, 2015
The comment has been removed
Anonymous
June 08, 2015
Hi CJ -- There is an interesting fact in working with Azure SQL DB and using the lower performance tiers in that there is a limited amount of TempDB and Transaction Log space that can be used. If you chose the lower DTU options (as I suggested in my first post) then you possibly can run up against the throttling of the Transaction Log when running the script to populate the fact table. Specifically if you look at this article: msdn.microsoft.com/.../dn338081.aspx you'll note that the error message points to the fact that the transaction log is too large for the tier of service you selected. You can get around this error in a couple of different ways; The easiest is to increase the Scale of the database for the duration of the load (use the Azure Management Portal, select your database and then select the Scale tab, then change the scale) or you can add the WITH (TABLOCK) option to the INSERT command like this: INSERT ChicagoCrimes WITH(TABLOCK) SELECT [ID] ,[Case Number] ,[Date] ,[Block] ,[IUCR] ,[Primary Type] ,[Description] ,[Location Description] ,[Arrest] ,[Domestic] ,[Beat] ,[District] ,[Ward] ,[Community Area] ,[FBI Code] ,[X Coordinate] ,[Y Coordinate] ,[Year] ,[Updated On] ,[Latitude] ,[Longitude] ,[Location] ,CAST([Date] AS DATE) FROM [dbo].[Crimes_New] ORDER BY [Date] ; I hope this helps, and thanks for the comment!

共用方式為