-
Notifications
You must be signed in to change notification settings - Fork 2
Archival Data Quality
Archiving preserves data for prosterity. The goal of archiving documents is to take a snapshot of the data as they were at the time of archiving. Then, the documents may at a later time be made accessible to a user, either the original data producer, a researcher or anyone interested, who have legitimate access rights to the documents. Ideally, the document should be as similar as possible to the original document, they were copied from, both in content and structure, but technical changes may be performed by the archive in order to ensure persistent accessibility to the document throughout changing technological landscapes. This may result in data loss, but if so it must be reduced to the minimum and it must be documented.
Here follows a list of requirements, which CLISC checks for and performs actions to the archived spreadsheet in order to make sure the spreadsheet meets the necessary data quality level for long-term storage. If archiving is not selected when using CLISC, the list of requirements will not be followed.
Spreadsheets containing zero data have no value for later reuse and hence no value for archiving.
Risk
There's two risks for spreadsheets, either having no sheets or no cells with data. Fortunately, Excel does not allow user to save a spreadsheet without any sheets and a COM exception is also thrown when using Excel Interop.
System.Runtime.InteropServices.COMException: 'The project folder must have at least one visible worksheet.
However, other ways of deleting the sheets without using Excel might exist, and secondly in Excel saving a spreadsheet without any cell values is allowed.
Solution
Check for number of sheets (minimum one) and check for presense of cell value in any sheet in the spreadsheet (minimum one).
Spreadsheets may be password protected at file, workbook, sheet or cell levels. This protection may relate to both read and write or just write. Spreadsheet may be write protected without password at file property level.
Risk
Spreadsheet cannot be opened, if password is lost over time. Spreadsheet cannot be validated or archival data requirements be performed.
Solution
Alert user if spreadsheet cannot be read because of password protection. Spreadsheet should be manually unprotected.
Spreadsheets may have data connections to external sources such as other spreadsheets, CSV or XML files or servers.
Risk
The data connections will allow for future updating of the affected cell values, which will compromise the snapshot of data, which is taken at the time of archiving. Documenting the existence of a data connection as metadata is not considered significant for the authentic preservation.
Solution
Remove any data connections from the spreadsheet.
Spreadsheets may contain other spreadsheets, Word documents, PowerPoint presentations, 3D objects or images and these objects may be in a format, which are not longterm sustainable.
Risk
Embedded objects may be in formats, which an archive does not accept for longterm preservation.
Solution
Alert user of existence of embedded objects, so taht the user can manually extract the data from the spreadsheet and store the data separately.
External relationships include linked OLE objects (they are not embedded) and linked cell values fetched from another spreadsheet. Both external relationships relate to files in your local directory, and which relationship will be broken, if the files or spreadsheet is removed from the local directory.
Risk
Data from the context of external relationships will be lost, if a spreadsheet is archived, which typically means removing the spreadsheet from it's local environment parameters if the archive is pursuing a data migration strategy.
Solution
Remove any external relationships. In the case of linked cell values, keep the actual cell values as the snapshot. In the case of unembedded OLE objects, they should be handled manually and the object should ideally be archived separately and the relationship documented in metadata. Alternative solution is to embed the OLE object.
Microsoft Office Excel provides a worksheet function, RealTimeData (RTD). This function enables you to call a Component Object Model (COM) Automation server to retrieve data in real time.
The RTD function uses the following syntax =RTD(RealTimeServerProgID,ServerName,Topic1,[Topic2], ...)
Risk
When opening an archived spreadsheet many years from now, the server connection might still be available and the values will auto-update. This intervenes with the archival goal of taking snapshots of data at a given time.
Solution
Remove any RTD formula functions in cells in any sheet in a spreadsheet, but keep the actual cell values as the snapshot.
Office Open XML spreadsheets may contain encrypted printer settings. Archived data should not contain any encrypted information. Embedded printer settings are not necessary for printing the spreadsheet.
Risk
The encrypted printer settings may be broken and criminals can get access to confidential printer addresses, which may compromise the network.
Solution
Remove printer settings from spreadsheet.
The spreadsheet may be created by using frameworks or by manually changing the XML. This may result in errors related to how the XML are allowed to be created according to the official schema files. This guarantees any valid spreadsheet can be opened by a standard spreadsheet reader. Therefore, each spreadsheet must be validated against the file format schemas and if the spreadsheet has any errors, the errors must be corrected.