Data quality at King County GIS – Part 2

In Part 1 of this three-part series we discussed data quality within the framework of GIS maintenance prioritization and data review. Data quality at King County GIS (KCGIS) also includes processes and tools for validation of the contents of the Spatial Data Warehouse (SDW). Multiple linked automated and manual steps help ensure good internal consistency and an anomaly-free environment, both for file-system and database objects. In this post we focus on this validation of SDW objects, which consists of cradle-to-grave tracking of datasets and all their representations.

This tracking occurs in four phases.

Dataset notification and pre-posting
Registration, posting, and SDW check-in
SDW omission and commission error reporting
Dataset archiving and retirement

These validation steps do not evaluate internal data quality—that is handled by other protocols, such as those discussed in the first post. Rather, these validation steps help maintain consistency across elements of the SDW and ensure that relationships between all dependent objects and representations are intact. These steps compose multiple linked workflows which involve communication with data stewards, automated consistency analysis and reporting scripts, and checks implemented by the database administrator and the data coordinator.

Dataset notification and pre-posting

At King County our SDW is a federated data environment where multiple County agencies publish their business data. When proposing a new dataset for the SDW, an agency or KCGIS Center steward would first complete a new dataset posting request form.

New dataset posting request form.

This information is then distributed via the GIS Data News Digest (now in its 171st issue) to a large GIS contact group which includes other stewards and a range of GIS data consumers. This step provides the opportunity to address issues related to best-practice conformance to naming convention, thematic library assignment, and stewardship point of contact, as well as alerting stewards and users that this new data will soon become available.

The proposed dataset is analyzed by a script which generates a standardized metadata template and evaluates the dataset schema for conformance to additional conventions such as allowable field names and lengths. This information is included with the Data Digest report and is provided to the dataset steward in a separate pre-posting report that details any issues that need mitigation.

Registration, posting, and check-in

Once approved for publication, the data steward registers the agreed-upon dataset name using Steward Tool, our control table management interface, and queues the dataset for the nightly posting process. For new datasets, a final flag must be set manually by a member of the Enterprise Operations team to allow the transaction to occur, assuming there are no outstanding issues. If there is an issue, a warning email is sent to the steward, which indicates that an issue must be resolved before final buy-off will be granted.

Every night a highly automated process called PostRep executes a rigorous set of checks, including conformance to the King County standard map projection and coordinate system, before copying the data from a staging database to the SDW servers. A comparable set of checks, but with more complex logic, is applied to the re-posting of updated datasets. These checks help ensure that change control is tightly managed. Regardless of how users connect to the data, the next morning they should find their applications and services performing normally while accessing or displaying the latest data revisions.

For new datasets a final check-in step is executed. After a successful PostRep execution, routines are run to generate the presentation layers (such as symbolized layer files) for the data and its metadata. The publication of all related components is verified through a 21-step checklist to ensure the dataset is available in all of its representations and locations where users expect to be able to access it. This report is provided to the steward for any follow-up actions, but also gives the Enterprise Operations team a confirmation that all automated processes completed.

New data publication checklist.

SDW omission and commission error reporting

Even with the rigor of the automated posting tasks and high standardization in manual steps, anomalies can creep into the databases and file systems that compose the SDW. Newly developed Python-driven validation routines are highly efficient in searching out omission errors, such as broken links (due to, for example, an unreferenced layer file) and feature-count inconsistencies (due to, for example, failed format conversion). These validation routines also search out commission errors, such as database or file-system objects existing in the SDW where they should not, as well as daylighting omission errors where records in the master control table fail to find a database or file-system representation of the dataset that should exist.

DataQualityPost2_SynchronizedGISDatabasesGraphic

Spatial Data Warehouse synchronization.

These omission and commission validation routines report likely anomalies that are then reviewed by the data coordinator and the database administrator. These anomaly reports are used to perform removals of unsupported representations if a commission issue, or a representation update if an omission error is flagged. Frequent execution of the validation routines helps ensure continuing compliance and also can help ferret out any systemic problems that may be regenerating issues.

Because of the key role that the control table entries play as the master record of the SDW contents, accuracy is enforced by double entry of dataset names—once in the table that drives the PostRep and validation processes, and a second time in the table maintained for actions reported in the Data Digest. Cross-match check routines help ensure ongoing consistency between the two tables.

Dataset archiving and retirement

When a dataset is to be retired from the SDW, specific steps ensure that no applications or services are broken, and ArcMap users are notified that existing map documents (MXDs) will need to be cleaned up if they reference a dataset that will be deleted.

A data steward first submits a request to have a dataset retired, along with the date they would like the transaction to be completed. The proposed deletion is advertised in the Data Digest to inform data consumers of the pending change. A script interrogates the SDW for all possible occurrences of the data and its representations. A detailed report is also generated that identifies what map or feature services dependencies may exist. This gives a heads-up to application developers who own applications or services that may be affected. Depending on the complexity of the dependencies, the data coordinator will then work directly with developers to address any mitigation that needs to occur.

Dataset archiving and retirement.

When the target removal date arrives, all file-system versions of the objects and representations are removed and the existence report is regenerated to verify removal. Then, any remaining map service dependencies are cleared. Once this step is finished, the object name is provided to the database administrator. Since all map-service dependencies on the database object are now resolved, a final script is executed against the database to archive the object and its control record. Then the database objects themselves are safely deleted. The Data Digest is updated with the final clean-up actions and the dataset transaction is closed.

Summary

The King County GIS Center continues to refine its Spatial Data Warehouse validation workflow. Tracking of the proposed transactions in a SharePoint task list is helping to manage the timely execution of proposed actions. The Data Handling team, mentioned in the first post of this series, works toward identifying stale or out-of-date data as possible candidates for removal, supporting incremental maintenance of the SDW contents, to avoid development of a large backlog of cleanup work.

The next and final post in this Data Quality series will discuss how King County maintains dataset metadata, and how new tools have been implemented to evaluate and grade conformance to standards, and work toward improving the quality of metadata content.

Mike Leathers is the GIS Data Coordinator in the King County GIS Center.