Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Thursday, January 27, 2011

Extending the School Revenue Data Mart Project

An earlier post showed a data mart of school revenues based on the Frederick County, Maryland (US) public school system.  This post extends the data mart to include a second school system, neighboring Washington County.

In the earlier post, the data model for the School Revenues Data Mart demonstrated basic load processing of data from the Frederick County, Maryland (US) school system.  This model version extends the data mart to accommodate more than one system.  The reports from neighboring Washington County and Carroll County were analyzed and this forced a broadening of the data structures.  This additional abstraction requires extra processing on the ETL side.

Attributes were added to the dimensions to support rolling up.  In the case of the school system, SR_DIM_SYSTEM now contains SYSTEM_COUNTRY and SYSTEM_US_STATE which may be repeated across several school systems.  SR_DIM_SYSTEM is a Type 2 dimension that changes as a school's enrollment changes.  SR_DIM_FUND -- formerly SR_DIM_REVENUE_SOURCE -- contains a FUND_SOURCE_NM (the provider) and a FUND_NM (the purpose of the funding).  SR_DIM_FUND is a Type 1 dimension (upsert).

There are two fact tables: FACT_REVENUE and FACT_ENROLL.  FACT_REVENUE is a list of individual revenue items and amounts.  FACT_ENROLL is a periodic snapshot to record the enrollment of a school system in a given year.

A reference table was added to normalize FUND_SOURCE_NM across the different data providers.  For example, FCPS lists "State of Maryland" while WCPS uses the more neutral "STATE REVENUE".  This allows the data to be input in a provider-specific manner, possibly keeping a pre-normalization audit record.

ETL staging tables were added so that more in-database processing can occur.  The first version of this data mart took multiple passes over the input spreadsheet.  While this worked for the simple case, in practice, a lot of flexibility was lost with the spreadsheet-based approach.  Operating on files can lead to a performance bottleneck.

  School Revenue Data Mart: May 21 2011  

There are reject tables mirroring the staging tables.  If there business rules are violated, these tables will contain the rejected records and an error description and rule ID.

The input spreadsheets were edited to help the processing.  Section headings were either concatenated with another column or broken out into a separate column.  Totals were removed completely.

The revised spreadsheets are here:
Feb. 7, 2011: I broke apart the enrollment spreadsheet into fiscal years to help with processing.  I'll also break apart the revenue spreadsheets as I add new data sources, other Maryland counties.

May. 21, 2011: I'm working on this on a laptop, so I moved the database to Microsoft Access.  I plan to move it back to MySQL or another db server when it's more stable.
 
The new data structures -- the ETL staging tables and the reference data table -- allow more work to be done in the database.  Look for a video tutorial to demonstrate the revised loading process.

1 comment:

  1. Note that the number of enrolled students could have been a snapshot-style fact, but it will likely be used in many reports and ratios as a focal point of analysis, especially in determining per-student spending.

    ReplyDelete