Prior to our sixth DITA session, Working with Data I, I found reading Broman and Woo’s 2018 paper Data organisation in spreadsheets very helpful and informative. Focusing on data entry and storage aspects, it offers practical advice for organising spreadsheet data to reduce errors and ease later analyses. Some of the basic principles included in this paper are: be consistent, do not leave empty cells, put just one thing in a cell, make back-ups, and so on. As somebody who hasn’t created an Excel spreadsheet in about 10 years, it provided me with some useful insight and crucial reminders about how to keep one’s data clean and accessible when using this software. Personally, I could instantly understand the benefits of sinking my teeth into this paper. However, I couldn’t help but think that some people on the course, who may use spreadsheets every day in their jobs, may have found this particular paper a little obvious in its content and pushed it to one side.
Indeed, the content of this paper should certainly be obvious to people that work with data professionally. However, recent stories in the press say differently. On October 8th this year, it was reported by the BBC that Excel ‘caused Covid-19 results to be lost’. Merely upon reading the headline of this article, it seems that the blame is being pushed away from the human professionals and towards the software. An interesting stance… Indeed, it is argued on the European Spreadsheet Risks Interest Groups website that the headline of this article should be corrected to ‘Why a lack of basic data controls caused Covid-19 results to be lost’. There’s no doubt this would be a much more honest (yet damning) headline.
Despite this suspicious choice of title, the article goes on to report the data illiteracy of Public Health England (PHE) when putting together important Covid-19 figures. It is stated that the badly thought-out use of Excel was the reason nearly 16,000 cases went unreported. PHE were tasked with the crucial job of setting up an automatic process to pull data together into Excel templates so that it could be uploaded to a central system. Problematically, PHE’s developers picked an old file format to do this – known as XLS. As a consequence, each template could handle only about 65,000 rows of data rather than the 1 million-plus rows that Excel is actually capable of. As a result of this, further cases were simply left off. To remedy this problem, PHE is now breaking down the data in smaller batches to create a larger number of Excel files.
It is obvious, as writers on the European Spreadsheet Risks Interest Groups agree, that PHE should have chosen a file format without a size limit to process their results. However, as is clear in Broman and Woo’s paper, whatever technology they used and whenever data is exchanged between systems, there data checks must be carried out that reconcile the output of a conversion stage to its input, for example carrying out record counts.
Shocked (and interested) by this story and the ramifications of data illiteracy, I wondered whether this kind of thing was an isolated incident (quickly realising there was no way it could be, given the nature of the aforementioned report). Embedded in Broman and Woo’s paper was a webpage (http://www.eusprig.org/horror-stories.htm), which is a public archive of spreadsheet horror stories. The PHE scandal was top of the list. Visiting this webpage led me to realise a worrying statement, ‘Spreadsheet errors are common and non-trivial’ (Panko, 2000). As we know very well from our DITA module, data is of the utmost importance in our lives, and the misuse of it can be (as was emphasised on this webpage) damaging at best, fatal at worst. As LIS professionals, it is crucial that we are responsible in protecting the integrity of data in order to ease later analysis.
Image courtesy of the BBC 2020
BBC (2020) Excel: Why using Microsoft’s tool caused Covid-19 results to be lost. Available at: https://www.bbc.co.uk/news/technology-54423988 (Accessed: 25 November 2020).
Broman, K. and Woo, K (2018). ‘Data Organization in Spreadsheets’, The American Statistician, 72 (1), pp. 2-10. DOI: 10.1080/00031305.2017.1375989
European Spreadsheet Risks Interest Groups (2020) EuSpRIG Horror Stories. Available at: http://www.eusprig.org/horror-stories.htm (Accessed: 25 November 2020).
Panko, R. (2008). ‘Spreadsheet Errors: What We Know. What We Think We Can Do’. Proc. European Spreadsheet Risks Int. Grp. (EuSpRIG) 2000. 7 (17). DOI: arXiv:0802.3457