How to avoid common mistakes when digitizing data sets
Since 2001, we’ve worked with dozens of academic researchers in several fields to enter data from print and PDF sources, digitize surveys and collect data from unstructured digital files. In the process, we’ve observed our fair share of avoidable mistakes and their negative impact on research. Below, we list some common mistakes and give solutions on how to avoid them for your project. We review all of these common mistakes before embarking on projects with out Academic Research clients. For more information about DDD and our services, please see our web page on Academic Research data entry: http://www.digitaldividedata.org/services/structured-data/
1: A poorly chosen conversion process increases costs, time frame or prevalence of errors
Optical Character Recognition (OCR) technology can be faster and cheaper than relatively expensive keying (re-typing). However OCR is unsuitable for some projects, with results so inaccurate that they cannot be used for the intended purpose. Without testing your conversion process it is not always possible to predict the quality of the output. OCR can often convert characters (numbers, letters, etc.) correctly, but often merges fields and cell positions. We also find academics frequently scripting OCR data into structured data bases. These scripts usually rely on punctuation or line break – exactly the two things that OCR is the weakest at converting.
Solutions
2: The scale and complexity of the project is inaccurately estimated because of the sheer quantity of the source materials, resulting in poor quality output, delays and increased costs
Consider the following scenario; a project of three hundred thousand pages is deemed after a brief inspection to be suitable for OCR. Budget is secured and timescales are set. However, in the depths of the project it is discovered that 10% of the pages have a typeset font that cannot be read by the OCR. These pages must be identified and then reprocessed, using double-keying. The project is 25% over-budget and several weeks late.
Solution
Add a scoping stage to your project during which you take representative samples and then thoroughly audit your primary materials, understanding the different conditions present throughout. Appropriate decisions can then be taken on the correct conversion process and how much time and budget will be required.
3: Poorly scanned, photographed or photocopied images make data entry expensive and inaccurate
Solution
Professional scanning can improve accuracy and even enable OCR to work properly, cutting costs considerably. Many academic libraries provide scanning services. Also, professional scanning vendors can scan from books (even fragile or rare), loose paper and microfilm. DDD works with professional scanning firms around the world and can recommend a partner.
4: Poor decisions on resourcing projects cause delays, low data quality and high overall cost
Some researchers report that with course work serving as a distraction, the repetitive, undemanding nature of data entry work and the lack of obligation to produce perfect results, the use of student workers can result in high costs, delays and low quality data even though their cost per hour to your budget may be small. Furthermore the scale and flexibility of student “market” is limited and the weight of managing the project and coordination the students rests on the shoulders of the academic commissioning the work.
Solution
Consider carefully whether students can fulfill your needs and investigate alternative resources, should you require them.
5: Within the source material the same data are entered in different ways, making collation for analysis difficult, especially where more than one person is entering data. For example;
Solution
Before embarking on conversion, review representative samples from the source materials to identify which differences may occur. Establish written rules for the correct way to enter these differences and train those inputting the data in how to follow them. Make them aware of the procedure to follow when a new difference (without a rule) is discovered. Ideally a review of the final data should be made to correct any instances that “slip through”.
Related to this issue and #6, below, is character “encoding.” Different software applications store text characters differently. Spreadsheet software often applies data masks for dates and fractions that can change the underlying text data. Be sure to check the data output in the final application you will be using for data analysis.
6: An inappropriate choice for the output format makes data hard to interrogate and/or expensive to reconvert
Excel or Word can seem the obvious choice, but other formats might be more effective for current and/or long-term use (XML, Access, SPSS, etc.)
Solution
Consider the current and future potential uses of your data. Is your chosen output format flexible enough to be easily converted in to other potentially required formats? In most cases it is advisable to avoid proprietary formats (e.g. Word 2007).
7: Inconsistent file-naming means data is lost or hard to find
Solution
Output file names should carry descriptive information, for example, “the_times_22_07_1976_p35.xls” so that they can be quickly and easily related back to the source materials. Any digital input files (such as JPEG or TIFF scans) should ideally have the same names.
8: Absence of rules for handling exceptions results in data inconsistency
Exceptions can impact final data analysis, and should be recorded and coded for reference. E.g. a table which spans multiple pages, a chart that has different titles inside a data collection, or special characters from a foreign language (i.e. a Greek “delta” character in a financial table).
Solution
Establish your processing rules (“tables /text /titles /authors /captions /pictures /headings, etc are handled in the following way”). Consider also your process for exceptions, however simple: “Any pages or page items which fall outside the processing rules should be stored in folder X and the corresponding entry in the output spreadsheet should be marked with an X.”
Do you mind if I post an extract from your content on my site, I will put a link back to your site?
Hello,Terrific blog post dude! i am Fed up with using RSS feeds and do you use twitter?so i can follow you there:D.
PS:Do you thought to be putting video to the blog to keep the visitors more interested?I think it works.Sincerely, Rutha Routzahn
Marion: We don’t mind, feel free to post!
Rutha: Thank you! We are on Twitter: http://www.twitter.com/digdivdata. And yes, we will try to post more videos on the blog, for now, please see Chhayrorn’s story here: http://www.digitaldividedata.org/news/2010/05/a-motion-picture-says-more-than-a-thousand-words/