Featured Post

Applying Email Validation to a JavaFX TextField Using Binding

This example uses the same controller as in a previous post but adds a use case to support email validation.  A Commons Validator object is ...

Tuesday, January 25, 2011

A School Revenue Data Mart Project

To try out Talend Open Studio in a Data Mart scenario, I put together a small project based on converting a spreadsheet of revenues for the Frederick Country (Maryland, US) school system into a dimensionally-modeled data mart.

I first formed a spreadsheet based on PDF reports posted on the school system's web site.  Each year, the PDFs reported actual revenues and their sources.  I left the expenses off for now.  I copied the values from each year into a column of the spreadsheet.  I didn't attempt to normalize any of the revenue sources since they are all from the same school system.  Here is a copy of the spreadsheet: FCPS Revenues.

The target data model is dimensional.  This means that each focal point of analysis -- year, school system, and revenue source -- will have a row in a dimension (a table).  A fact table contains the actual revenue value (a U.S. dollar amount).  The fact table will refer to the dimensions. The data model is presented below.

Dimensional Data Model for FCPS Revenues
I'll next create a Talend Open Studio Job for loading the spreadsheet into the data mart.  My focus will be on the rules and data transformations rather than the file processing (for now).

Here is a data dictionary describing the tables.
  • SR_DIM_SYSTEM.  A governmental school system.  A REVENUE must be reported by just one other SYSTEM.  A SYSTEM may report one or more REVENUES.
  • SR_DIM_YEAR.  A fiscal year.  A YEAR may record one or more REVENUES.  A REVENUE must be recorded by just one other YEAR.
  • SR_DIM_REVENUE_SOURCE.  A line item in the revenue summary.  A REVENUE_SOURCE may classify one or more REVENUES.  A REVENUE must be classified by just one other REVENUE_SOURCE.
  • SR_FACT_REVENUE. A value for the REVENUE.
The dimensions are Kimball Type 1 ("upsert") dimensions.

While useful for personal productivity, previous tutorials showed simple cases of loading an Excel file into a MySQL table or converting a text file to JSON.  Integration tools become far more beneficial with more substantive data work.  This spreadsheet and E-R diagram are the basis for a more significant example.

4 comments:

  1. Gartner has released its recent report on the data warehouse market, and once again Microsoft Azure has been ranked at the top of the leader board of Cloud data warehouse providers. In fact, the Azure team is now claiming that its most recent product release offers “the largest amount of scalable data warehousing capability for an on-demand cloud provider.” This new product release,Azure Synapse Analyticsis a fast, scalable warehouse solution that runs natively on the cloud.

    ReplyDelete
  2. Gartner has released its recent report on the data warehouse market, and once again Microsoft Azure has been ranked at the top of the leader board of Cloud data warehouse providers. In fact, the Azure team is now claiming that its most recent product release offers “the largest amount of scalable data warehousing capability for an on-demand cloud provider.” This new product release, Azure Synapse Analytics is a fast, scalable Data Warehouse Solution that runs natively on the cloud.

    ReplyDelete