For new DaaS implementations, Excel may be the best option for gathering data. That's because Excel workbooks can be expanded to fit new discoveries about the data itself made during the gathering. The loose structure has a downside; data entry errors can enter into the DaaS. Because the new data may not have been seen before, these might be tough to identify.
The post offers a few tips for gathering data. These tips are based on a project where public school budget information is being entered into a DaaS that produces reports and web services on an aggregated collection of budgets. The project gathers data across Maryland's counties, performing some light normalization while preserving the original source data.
The Maryland school budgets are generally published in PDFs. Some information, may need to be gathered from web sources (HTML) or PowerPoint slides. This document, a link to Frederick County Public Schools' 2011 adopted operating budget. The data being loaded is listed in tables under the Revenue Summary and Expenditure Summary sections.
Tip 1: Record a Normalized Value
The target workbook will reproduce the tables in the PDFs. The nested categories in both the Revenue Summary and Expenditure Summary will be flattened, with a top-level broken out. Determining the top-level is based on an analysis of the full data set, scanning through all of the sources. For example, all Maryland schools get funding from the State of Maryland.
The other categories (Food Services Operating Expenses, Unrestricted Fund Sports Fees) will need to be preserved.
Tip 2: Add Data Entry Checks
The most common error I found was to mistype an amount. This can be mitigated by keeping track of summaries and comparing the summaries against the source data. Any discrepancies will require a study of the amounts.
I added a row with a special heading "*** DATA ENTRY CHECK". This is a cue to the data loading program, in this case Talend Open Studio, to ignore these records.
|Source Data with Special Data Entry Check|
Tip 3: Record Source Information
With each record, I stored the source information. This includes both an original URL (which may become invalid if the document is moved or renamed) and a title and author. This lets me tie a particular data element with its source document. You could consolidate the source information, say embedding it in the filename or in a separate sheet. But files can be renamed, and I'm not sure that a separate sheet will give me the grain I need if I have to mix in data from more than one source document.
The file name is very descriptive and formatted to support automated processing. This is subject to change as I work with more files from different years and school systems.
A source workbook can be found here. This contains a manually-created Revenue Summary and Expenditure Summary based on Frederick County Public Schools' 2011 budget. I tried various cut-and-paste techniques, but the format of the tables wasn't parseable into columns by Excel (the only delimiter available was spaces that broke apart the categories). There are some data integration PDF components that may help if all the source data is in the same format, but this isn't the case for the school budgets.