EfficientDataEntry 110314
EfficientDataEntry 110314
Suggested citation:
Arif, A.S., A. Benini, et al. (2010). "Efficient Survey Data Entry. A template for
development NGOs". Sylhet and Washington DC: FIVDB [Version March 2011].
Table of Contents
Preface ..........................................................................................................................1
Acronyms and abbreviations ..........................................................................................3
Acknowledgements .......................................................................................................3
Summary .......................................................................................................................4
Introduction ...................................................................................................................7
Data entry under time pressure...................................................................................7
[Sidebar:] Types of entry errors .................................................................................9
Survey quality in development NGOs ......................................................................10
Data entry and empowerment ..................................................................................12
[Sidebar:] Rapid utilization in the field ....................................................................12
Ergonomic and data quality motivations ..................................................................14
Organization of this study ............................................................................................16
Mechanics of data entry ...............................................................................................16
Assumptions and precautions ...................................................................................16
[Sidebar:] Spreadsheets for data management - pros and cons ..................................18
Demo file.................................................................................................................19
Key elements ...........................................................................................................20
Technicalities: Excel building blocks ...........................................................................24
Outside VBA ...........................................................................................................24
[Sidebar:] A way to construct unique identifiers.......................................................26
[Sidebar:] Treatment of multiple-choice response data .............................................28
In the VBA code ......................................................................................................30
Experience working with the template .........................................................................38
Speed.......................................................................................................................38
Reliability ................................................................................................................39
Assembly of master tables .......................................................................................39
Shortcomings ...........................................................................................................40
[Sidebar:] Re-coding variables - an important skill ..................................................40
[Sidebar:] Contracting data entry out to grassroots groups .......................................42
Outlook .......................................................................................................................44
References ...................................................................................................................46
Sample macro code......................................................................................................48
Revisions and known bugs...........................................................................................67
Author information ......................................................................................................68
List of Figures
Surveys contribute, in important if not always prominent ways, to the information flow.
They supply information that guides a variety of deliberations and decisions, from the
selection of program sites, through the recruitment of participants, the enhancements of
program reports beyond financial and administrative data, and ultimately to the
measurement of outcomes and impact.
First, although rarely discussed in these terms, there is a market for data in FIVDB and
similarly, I believe, in other development NGOs. Data is continuously being produced
and exchanged. Survey data is produced, not by one department only, but in
collaborative efforts. Our monitoring associates may be the ones committing the bulk of
the data to computers, but in the first place the interviewing in the villages is done by
frontline workers of other departments. These units give time and effort because they
hope that the surveys will benefit their work too, in the short run by accessing potential
clients, in the longer term by demonstrating its impact. The market for data is not only
internal. Volunteers in Community Learning Centers create village maps, with
numbered households that are the basis for sampling later during outcome measurement.
They do this because they expect FIVDB to return information of practical value.
1
Second, FIVDB has been using computers for the management of survey data since
1987. Our command of this technology has grown but slowly and with occasional
setbacks. We are still struggling with hardware problems in field offices, but even more
so with the turnover of trained personnel and with the conceptual challenges of building
a consistent survey tradition. At the same time, the qualitative demands for data have
grown, in a competitive aid climate that wants programs to demonstrate impact. While
our donors are free to commission external evaluations, FIVDB itself aspires to monitor
its work using data that I like to call, for lack of a better term, "near research-grade".
Valid definitions and reliable processing become imperative if we are to be credible.
It is in those two contexts - collaboration and quality - that data entry technology
matters. The better it helps to reduce errors, the freer will our statistics be of noise and
bias. The faster it does the job, the sooner will FIVDB, its donors and its partners in
villages and government have access to survey data, survey findings and their
implications. The easier it is on the monitoring associates, the more time and energy
will they save to do other important work, away from their desks and in the true field.
The new data entry template that one of our large baseline surveys is experimenting
with is a step in that direction. This paper details its technicalities and makes its core
elements available for others. Like other technologies, it will be effective if it is "always
at our side, and never in our way". Data entry is an important subject for survey
practitioners - so that the rest of us need not bother about it.
Bazle M. Razee
Associate Director, Program & Planning
2
Acronyms and abbreviations
Acknowledgements
This study would not have been possible without the work of these colleagues in the
FIVDB Policy, Planning and Research Unit:
Arif Azad Khan, Arif Mohammad Shakil, Hasena Begum, Md. Abdus Salam, Md.
Yeasin Mazumder, Muhammad Al Amin, Muhammad Saddam Hossain, Rakshit
Bhattacharjee, Rashida Jahan Qureshi, Samaresh Talukder, Syeda Shahina Akther Ruba.
Dipok Roy, of the FIVDB head office staff, took several of the photos used in this
document.
3
Summary
Data entry in particular is the least loved of all survey phases, feared as a source of cost
and error, a necessity without learning value. Survey leaders, worried about cost, time
and reliability, are tempted to invest the bare minimum in data entry operators. Greater
productivity is expected chiefly from technical improvements, often designed as closed
applications that the NGO subsequently cannot adapt to changing needs.
This approach is short-sighted. Appropriate data entry tools, together with skills training
and encouragement to rapidly analyze data in field offices, can do more. They are
empowering, for direct survey workers as well as for contributing program staff and
grassroots organizations.
This paper presents the rationale and key elements of the template. It comes as a
workbook in Microsoft Excel, an application widely used among NGO workers. We
describe most elements in a non-technical language. For technically inclined readers, we
4
detail how a special Excel feature - event macros - is harnessed to check entries for
valid values and to scroll to the next entry cell.
We document the code and supply a demonstration file. Specifically, the template may
appeal to NGOs which
The experience that FIVDB has so far made with the tool suggests that data entry is
considerably faster and less prone to produce spelling or data type errors. Also, due to
5
the protected variable sequence, the process of assembling master tables from numerous
contributing tables (for different villages, and from several field offices) is safer.
As a side benefit, this template has enabled FIVDB to outsource a large part of survey
data entry to a computer center run by a village grassroots association. Efficient prompts
for categorical variables and automated error flagging ensure good quality with minimal
outside supervision, setting monitoring staff free to attend to other work.
Adapting the template to the needs of other surveys involves the definition of data tables,
variable codes and categories, and macro segments as exemplified in the text as well as
in the demonstration Excel workbook. The workbook can be downloaded via this link.
This tool can be further improved. Also, we are convinced that "out there" in the wider
NGO world many other solutions have been applied to data entry challenges. Over the
next years, survey data may increasingly be captured in programmable mobile phones.
Many of the required conceptual skills, however, will remain the same.
Improvements in survey data entry are only a small part of an evolving toolbox. FIVDB
recognizes that better data management and analysis skills are needed in monitoring
staff as well as in their close program staff counterparts. Increasingly, NGOs are under
pressure to demonstrate impact. The capacity for self-observation is critical in
competitive funding markets. More efficient surveys add to it.
6
Introduction
FIVDB, like many other development NGOs, has been conducting numerous surveys
over the years. The nature of surveys has varied. Some were undertaken to satisfy
project reporting; others looked at impact out of a concern that FIVDB and its donors
shared; yet others were done for our own learning. A few followed templates prescribed
by specific donors, sometimes without any role left for FIVDB to take part in analysis
and discussion. In most surveys, however, FIVDB designed and executed all the phases,
from concepts to dissemination of findings.
That has included data entry, which we have, with rare exceptions, done in-house. In
2008, FIVDB added a new project, Jonoshilon (FIVDB 2008), with educational, social
organization and livelihoods interventions planned in 850 village communities. A dozen
monitoring staff were hired to better observe the expanding project. They have been
engaged in large baseline surveys. Despite the staff reinforcements, time and resource
constraints made themselves felt particularly in the data collection and entry phases.
In order to ensure data reliability while at the same time easing the strain on field
monitors, we have rationalized data entry over successive survey rounds. At present, an
innovative data entry template is being used. This study details its architecture and
documents macro code that other survey designers may wish to adapt to their own needs.
We trust that there are other solutions and insights "out there" in the large world of
NGO-driven surveys and are eager to learn from them for FIVDB's own productivity.
Data entry does not command much prestige. It is accepted as a necessity and feared as
a source of cost and of error. Except in rare circumstances - qualitative research projects
that combine coding with data entry come to mind -, the learning value of this survey
phase is low. Often the work is contracted out or assigned to minimally skilled workers
who otherwise have no stake in the research 1.
Under time pressure, or for lack of proper supervision, entry errors can be substantial. In
a survey quality perspective, these errors are a subtype of processing errors. They
appear on the left side of the survey lifecycle diagram in the classic book on survey
methodology by Groves et al. (2004: 48). It may be significant that the authors do not
even consider data entry a subject worthy of discussion. In practice, data entry
challenges may bedevil also the representational qualities of surveys (the right branch of
1
These and related complaints have been detailed for large-scale epidemiological studies, which also rely,
in considerable part, on questionnaire surveys. For example, Ali et al. (2006: 1) note that an "efficient
data management system is often not available", and "surprisingly little attention is paid to the
computerization of data".
7
the diagram); under time pressure, only part of the questionnaires may be processed,
causing the effective sample to fall short of targeted precision.
Barchard and Pace write: "Data entry errors can be disastrous. In many industries, data
entry errors are minimised by paying highly skilled data entry personnel or by using
expensive optical scanning technologies. However, researchers, small businesses, and
non-profit organisations cannot afford highly skilled data entry personnel or the
alternative technologies, and instead rely upon minimally trained staff and volunteers
for manual data entry. Obtaining accurate data entry is a challenge in these
circumstances." (2008: 359).
"Disaster" may sound overly alarmist; but their description of the issue outside well-
endowed research agencies is pertinent; it leads us closer to the world of NGO-led
surveys, with its reliance on manual data entry.
8
[Sidebar:] Types of entry errors
While data entry errors form a subgroup within processing errors (which some authors subsume
under measurement errors while others treat them separately), it is useful also to clarify
differences within entry errors. In their "A Taxonomy of Dirty Data", Kim et al. (2003: 84) list
several types, which we have arranged in this diagram.
Involving In multiple
single field fields
Incorrectly
Misspelling
derived
Extraneous
data
They give examples of instances in types that are not self-explanatory: An erroneous entry
happens if, e.g., age is mistyped as 26 instead of 25. Extraneous entries add correct, but
unwanted information, e.g. name and title in a name-only field. An entry of an incorrectly derived
value occurs when a function was incorrectly calculated for a derived field (we might think of a
data entry person who incorrectly transforms British into continental measures). Inconsistencies
across tables or files occur e.g. when the number of employees in the employee table and the
number of employees in the department table do not match.
These authors do not include so-called dangling data under entry errors, i.e. records in one
table (e.g. household members) for which at least one corresponding record in some other table
is required, but is missing (e.g. household). To the extent that the corresponding records were
all present in the source document, one is inclined to treat the loss as a data entry error.
We believe that another, rarely discussed important distinction concerns whether and how
errors can be recognized and corrected:
1. by pre-programmed validation rules (e.g. Sex = male and Pregnant = yes Error!)
2. by comparing with the source document or by double entry
3. by outlier and plausibility tests, with possibility to re-measure in the field
4. as in 3., but without possibility to re-measure.
5. without any basis to determine implausible values
9
In cases 3 - 5, computer entry and source document may agree. If so, these errors are not entry
errors.
Obviously, there can be other error typologies, based on other distinctions; there will also be
types of errors detected in data management practice that overlap several types.
These differences are not immutable. Technology has changed the survey environment.
Mobile phones and e-mail have made survey management easier, in terms of faster
coordination and document transmission. Technological advances have facilitated data
entry as well.
There are a number of computer applications meant to reduce the data entry burden as
well as entry errors. Among relief and public health personnel, for example, Epi Info is
popular (CDC 2005). At the high end of data management, the US Census Bureau has
made available a programmable application - CSPro (U.S. Census Bureau 2009). In
development NGOs, it is not uncommon to call on external consultants to create
dedicated applications using Microsoft Access, for data entry as well as for standard
reports. The statistical application SPSS enjoys widespread popularity, particularly
when survey operations are shared between NGO staff and consulting firms.
Some of these applications work together with modern technologies that capture data
electronically at the point of response. These - known by acronyms such as CAPI
(computer-assisted personal interviewing) 2 - have made headway in relief and
professional survey organizations. As mobile telephones become cheaper and more
2
This is a fast-moving field (See e.g., Cork, Cohen et al. 2003: probably obsolete by now), with
interesting connections also to mobile telephony (Tomlinson, Solomon et al. 2009) and GPS devices.
10
widespread (phenomenally so in Bangladesh), we will over the next years see more data
entry tools developed for these devices - for interviewers capturing the response in
programmable phones and transmitting it to a remote database, as well as for sample of
respondents who will be given such phones to communicate repeated measurements
(Couper 2005).
The adoption of these technologies by national NGOs in poorer countries has been slow.
FIVDB's surveys are using, and the rest of this paper assumes data entry off, traditional
paper-and-pencil questionnaires.
For the build-up and maintenance of survey competency in development NGOs, data
entry relying on EpiInfo, SPSS and Access-programmed special applications suffers
from two serious drawbacks. First, these applications are centralizing in the sense that
not only the design, but also the operation of the specific data entry modules remains in
the hand of few persons - those who can realistically be trained and supervised for the
purpose. They generate little or no skill transfer to other persons or for other tasks. They
are often not sustainable because the expertise for needed design changes is lost, too
slow or prohibitively expensive. Typical manifestations are abandoned database
applications 3 and socially isolated analysts in backoffices who produce one-time
statistical analyses for reports that will never be used in the field.
3
In the extreme, they are created by consultants who subsequently emigrate together with the password,
which they never shared with their principals, as we found in a NGO in Nepal.
4
Survey researchers have long recognized this danger. One of the remedies recommended consists of
"post-coding" unanticipated categorical and other qualitative information, the operation of assigning
codes to recorded (textual, verbal, pictorial, etc.) information after the interview and, ideally by persons
other than the interviewers, before or during computer entry. Some code-category pairs may be created
only then.
However, the skill requirements may be considerable (de Silva and Gunetilleke 2008). The logical
challenges of coding have prompted in-depth studies in other fields as well, notably in health care
informatics, particularly in processing information on (partly patient-described) disease symptoms, which
may defy clear diagnoses. The two approaches to coding in this field are called "pre-coordination" and
"post-coordination", but the rationale is the same as with pre-coding/post-coding: "Pre-coordination and
post-coordination can complement each other, with pre-coordination providing logic and intricacy and
post-coordination allowing expressivity and more complete domain coverage" (Rosenbloom, Miller et al.
2006: 280).
This is a thorny issue and one sufficiently important in our view to devote a sidebar to the topic of
recoding, which we consider a desirable skill needed in the transition zone between data entry and initial
analysis. See further below, page 41. The need arises regardless of what entry template one uses.
11
Data entry and empowerment
We offer elements of a more flexible data entry arrangement, with built-in controls that
help to reduce errors and at the same time speed up the entry work. We use an
application - Microsoft Excel - that is widely known by development NGO personnel.
We exploit features of Excel that allow field personnel to add or modify category sets
while making the coding easier. Our template, built with these elements, encourages
several desirable practices:
The opportunities for creative feedback often go unnoticed. NGOs do go out of their way to build
analysis skills in the groups they organize - in literacy, accounting, resource mapping and in a
proliferating variety of participatory research and reflection tools (Chambers 2008). In most
events, the information used is on one village community or small grassroots organization.
Comparative perspectives on groups that do not regularly meet are not prominent.
It takes relatively little investment in Excel skills training for area-based monitors to produce
simple tables that speak to a number of neighboring communities or program groups. Our
template encourages the practice of making local copies of data tables and to analyze them
long before the overall survey report is available. Appending tables from several villages and
extracting comparative statistics via Pivot table require skills that are easily teachable.
Monitoring associates have produced such tables, in the Bangla script and on a large sheet that
neo-literates could read, for a number of Community Learning Center meetings. For example,
they offered three posters with tabulations for discussion when they met with some 35 people of
Pirergaon village in Sylhet District in August 2010. Two of the posters spoke to the situation of
the village. The first table compared literacy levels by gender. The second summarized the
adoption of hygienic latrines by four wealth ranks. The ranks, from rich to ultra-poor, had been
assigned to households by the Center volunteers themselves, at the time of creating a village
map. Literacy and sanitation data were elicited during an FIVDB survey, which then also noted
wealth ranks.
12
The third table contrasted Pirergaon with its neighboring village of Foringura, as regards the
school enrollment of children age 5 - 12. This data had also been collected in the survey and
broken down by the wealth ranks of the children's parental households. In total, 134 of
Pirergaon's 196 children in that age bracket attended school, and 249 out of the 309 in
Foringura.
This manner of presenting information in tables and in comparison with some other village was
new for many, and time was spent to explain and clarify. The discussion was lively. Questions
ranged from curiosity ("Who is the child that you say is the one child from rich families in our
village that is not going to school?") to appeals for help to continue one's children's education, to
the expression of local pride ("Our situation is better than Foringura's in all respects, except in
education" [Never mind that on literacy and sanitation the meeting did not have comparative
information!]).
Requests for action were also advanced. FIVDB was requested to provide a print-out of those
fellow villagers who were not fully literate (for literacy training). But most speakers singled out
sanitation problems - the need to install hygienic latrines in all households and the inability of
the 14 ultra-poor households to pay for them from their own means. Again, a request for a
printed list was made, of those 14 households, for whose benefit the committee wanted to
collect money. The meeting ended with a request that the monitoring associates come back with
other comparative statistics that would let the villagers understand their situation in a new light.
13
Our impression is that the skills level of monitors rarely is the most limiting factor on the use of
survey data in field offices and among program participants. Role ambivalence - other field staff
treat monitors as data entry clerks and reporting assistants -, excessive interviewing and data
entry burdens as well as difficult coordination with other program lines are more deterring.
These conditions can be improved through supportive policies from the center. Chances will
then be better that local participatory forms - in the recruitment of program participants or in
grassroots planning events - benefit from computerized survey data. To the extent that data
entry takes less time, and information of local value can be produced faster, survey practices
and participatory culture can support each other in a virtuous cycle.
Between summer 2009 and spring 2010, FIVDB conducted a baseline survey of all
households in 98 villages of four districts in northeastern Bangladesh. Data was
collected and processed on over 19,000 households and over 110,000 residents. The
data was entered by eleven field office-based monitoring associates. Most of them had
undergone data management training in Excel in two workshops in 2009 totaling eleven
training days. It had been collected by program staff trained by the monitors and
supported by grassroots committees. The committees - so-called Community Learning
Centers - shared maps of dwellings that they had drawn during participatory reflection
14
events in their village communities. While the village maps were important for
collective learning, the baseline survey was able to access them for elements of the
unique household and person identifiers that will be needed in later follow-up surveys.
In retrospect, the baseline data entry effort is difficult to quantify, but it is safe to say
that it took several months of each monitor's working capacity.
The monitors entering the data had undergone more intensive data management training
than what workers in similar positions in other Bangladeshi NGOs are typically
afforded. Yet, there were two problems that impaired productivity and reliability:
• The template made extensive use of Excel's data validation feature. While
reducing errors, it also obliged the monitors to enter categorical data by multiple
mouse click-and-scroll operations for every active data cell. These were time-
consuming and tiring.
• The data sheets were not protected. For workflow convenience or out of local
analytical interest, several monitors took to rearranging the order of columns in
data sheets. This complicated the table appending operation while creating
master tables for the 98-village database. Some of the more subtle reordering
was not detected until the analysis stage; amending the tables and re-doing
analyses lost considerable time.
It was primarily the drive to mitigate those problems that led to a modified data entry
template, including features that the following sections present. The monitors requested
a template that would require less frequent switching between key strokes and mouse
operation. The organization wanted to minimize errors in data entry and in appending
tables for the master file.
In addition, the ability to locally add categories to the set of legal values seemed
desirable. Such a feature would avoid the dilemma between provoking error messages
or relegating novel categories to supplementary text fields. For example, when asked
about membership in organizations, interviewees reported some that are strictly local
and were unheard of in the FIVDB head office. Given their local importance, they
should appear in tabulations and, for this to happen, need to be in the set of
organizations, with codes prompting for them and with (at least locally) uniform
spellings.
The template was to find the right balance between central control and local autonomy
in other ways too. It was felt important to create a system of unique record identifiers.
These needed to respect centrally prescribed rules; they needed to be created in the field
offices before the master file of all survey records would be ready while incorporating
elements of village maps kept in the Community Learning Centers.
15
Organization of this study
We proceed as follows. We spell out the assumptions, particularly regarding staff skills.
We describe precautions taken against potential errors and for better productivity. We
describe the basic mechanics of the data entry workbooks. Since surveys may collect
data on more than one entity, we open a sideline to illustrate a two-level data structure.
For our household and household member data, we describe a system of unique
identifiers that facilitates linkage between tables and with follow-up surveys. We then
proceed to the more technical part.
This part is again sub-divided. We describe devices that our template uses without the
need for Visual Basic for Applications (VBA) code. Separately, we present sample
VBA code used in macros that monitor and speed up the data entry.
We then report on benefits as well as problems encountered during the use of the
template for the current baseline data entry round. We hope that readers may contact us
with improvements. We assume that, "out there" in the networked audience, there are
individuals and groups savvier than we are in survey support, and that we can learn
from them to improve this template or to altogether replace it with something better.
• A monitoring unit, attached to the NGO central office, designs most or all of the
survey tools. It trains the field staff in its direct line in their use. These workers
learn how to train and supervise interviewers (recruited among program staff
and grassroots volunteers) and how to manage rapport with village communities
and with others whose cooperation is beneficial (such as local government
officials).
• Head and field offices are equipped with computers in which a macro-enabled
version of Excel has been installed. The head office shares data entry templates
with the field offices; the field-based monitors have been trained in their use
(and ideally were part of the pre-test and revisions). File naming, sharing and
back-up protocols are in place.
• Field-based monitors possess a medium level of Excel skills. They know how to
link, append and merge tables, understand identifier concepts, and can make
16
two-way tables off data tables. They are motivated and able to do simple
analyses of the local data.
• Program and monitoring staff share a concern for reliable data. They expect to
feed back some of it to the grassroots organizations who helped in the collection,
for local targeting and planning purposes.
Experience with an earlier template that FIVDB used in 2009 recommended a number
of precautions. These were built into the next data collection and entry round in 2009 -
10. Some of them may be of general value:
• Field offices receive data entry templates. The monitors, who install them in
machines with MS Excel with macros enabled, have previously been trained in
their use as well as in basic descriptive data analysis.
• The templates have data entry sheets with protected areas. These are passworded
at the central office. Similarly, the macro code is protected.
• Macros supervise whether values entered into fields are legal. Illegal entries
trigger red cell background and a warning text. Both can be removed by the data
entry operator, whose corrections will not be visible to others.
Meanwhile, individual field workers or area offices are free to assemble copies of
village-wise data sets for their immediate local analysis needs. These assemblies will
keep the same record identifiers; calculated or additional variables can thus later be
merged.
5
E.g., "Grameen Bank" by "g", "5", "Grameen", "Gramin", or by whatever the laid-down set for this
value. Monitors may also add their own codes as long as they are unique in the category set, e.g. "gram".
17
[Sidebar:] Spreadsheets for data management - pros and cons
Our choice of a spreadsheet template for survey data management needs justification. In
professional research, the gold standard remains the relational database. Even in less
sophisticated survey environments, there are serious objections to this use of spreadsheets.
This bullet list of pros and cons is eclectic; readers may add more compelling points.
Objections
• Spreadsheets as such do not reinforce the culture of documenting data whereas
database applications offer data dictionary features. Knowledge of what the data means
is often limited to those currently working on a survey and dies at the end of the project.
• Spreadsheet applications do not provide data entry forms that closely imitate the layout
of questionnaire and data form pages. The differences in format make data entry more
strenuous and error-prone.
• Database applications have higher standards of data safety, both technically and in the
culture of the users. Spreadsheet data are vulnerable, particularly in multi-user projects.
• Complex multi-entity structures are difficult to model in spreadsheets. Even for relatively
simple multi-table databases, relational queries take extra skills, effort and devices.
Although this is a specific consideration in poorer countries, it is one more point in favor of
database applications: With the spread of Internet access also to remote field offices,
decentralized data entry can take advantage of the fact that Microsoft Access or OpenOffice
Base files can simply be attached to e-mail messages. Collaborative database use does not
depend on an integrated corporate network.
Points in favor
• Database applications depend to a greater degree on external expertise and support. In
less well-endowed survey environments, this often does not come forth. Collapse,
abandonment or invalid results are frequent consequences of survey data management
programs created with such tools.
• Spreadsheets are extremely popular; significant numbers of NGO workers, inside and
outside units conducting surveys, use them. Skill levels within units and organizations
vary widely; stronger users are available as informal trainers. Inter-departmental
coalitions can be built around managers and, below them, around spreadsheet users
working on a common task such as the entry, analysis and use of survey data.
Spreadsheet skills acquired during surveys are transferrable to other tasks.
• Some spreadsheet applications, including Excel, offer data validation features. Data
entry templates can incorporate them in order to force legal entries.
18
• First, the concept of "total survey quality management" will likely be absent or relegated
to something desirable, yet infeasible. In the culture of NGOs that are not chiefly
devoted to research, even the more restricted notion of "measurement error" (and its
absence in unbiased and reliable measurements) is outside the daily concerns of the
majority of workers.
Notions of "processing error" are stronger in units tasked with financial transactions and
contracts, such as Accounts, Micro-finance and Human Resources. The mental models to avoid
and correct them are borrowed from audits - adherence to manuals, balancing and visual
double-checking by other persons. These departments are used to complementing their
dedicated computer applications with spreadsheet applications of which they are the end-user
programmers.
This discipline is helpful, but not sufficient for quality survey data. By itself, the audit model does
not protect from practices that compromise survey data validity. For example, NGO monitors
often do not see the need to distinguish between missing data and genuine zeros, on the
rationale that sums and counts of positive values are not affected.
• Second, NGO staff working on a survey often have other tasks as well. They may be
dynamically assigned to this survey, other surveys and to non-survey work in response
to seasons, transport and electricity shortages, organizational priorities and staff
changes. Concept consistency (among workers) and concept permanence (over time)
can therefore be a challenge in data entry as well. Specifically, and in contrast to
professional research agencies, survey data use in development NGOs cannot always
wait for the final, "locked and frozen" version of the database. Part of the data may have
to be used on an urgent basis, for programming purposes. Additional record entry,
corrections, reformatting may go on for considerable time after a first partial analysis.
The co-existence of multiple versions of data tables creates its own dangers and
burdens.
Therefore, technical training of those collaborating in various survey phases does more for
quality than applications focusing on data entry by low-skilled workers do. Since spreadsheet
applications are socially more inclusive than database applications, we opt for the former, with
all due anxiety about the things that may still go wrong.
Demo file
The template being used in the current (summer 2010) round of the FIVDB Jonoshilon
survey accommodates 139 household-level and 31 individual-level variables. A segment
of the template has been made into a downloadable demonstration workbook (link). The
19
file name is "FIVDB_DataEntryTemplate_100820.xlsm", the password is "password"
(all lowercase). It will be needed to change protected cells and to access hidden columns
in the data sheets. The same password makes the macro code in the VBA editor (press
Alt + F11) accessible.
For this purpose, the household-level variables have been reduced to 40, and the
individual level ones to 23. Data is included from one village on 20 anonymized
households and on the 150 corresponding household members. As many as 14, resp. 15
of the variables, including household head and member names, serve identification
purposes. The demonstration workbook retains them all in order to show how the
unique record identifiers are formed. The substantive variables in the demonstration
workbook were selected chiefly so as to illustrate various types of macro code useful
when adapting the tool for other surveys.
We advise readers to have a first look at the workbook before reading on. Following the
description of the template's key elements below, some may want to add a few fictitious
records to the household and household member data sheets, perhaps even modifying
some category sets in the list sheets. Accessing the protected area may be instructive for
those closely looking at the macro code.
Key elements
In the basic mechanics of the template, four key elements work together. These are: the
sheet structure of the workbook, the protection of sensitive cells and areas, macros
attached to the data sheets, and flexible coding of categorical variables.
We describe each element in non-technical language. Subsequent sections will detail the
technicalities. Readers not concerned with this detail may then skip forward to the
sections commenting on our practical experience with the template.
Workbook set-up
For each surveyed village, monitors save a separate workbook based on the data entry
template. The limitation to one village is motivated by concern for the uniqueness of
household numbers (taken from the village map). Also it makes supervision easier; all
the data on a given village is usually entered by one or two persons.
• village-level data
• household-level data
• household-member level data.
For simplicity, the demonstration workbook operates at two levels, the household and
household members.
For each level, three worksheets and a private VBA code section are provided:
20
• The LevelX_Vars sheet holds metadata; it enumerates the positions (column
numbers), names and labels (= full names, meanings) of the variables that hold
the survey response at this level.
• The LevelX_Data sheet is the one in which the data will be entered. Copies of it
will be used to assemble the master file in the central office or to do local data
analyses in a field office.
• The LevelX_Lists sheet holds category sets for categorical variables (e.g.,
occupation) and the admissible prompts (codes, abbreviations, alternative
spellings) for each included category (e.g., typing "Bee" would be enough for
"Bee keeping").
• VBA code is attached to each LevelX_Data sheet.
Figure 4 exemplifies the structure of the workbook for the two-level data situation. The
arrow between the two data worksheets indicates that household member records are
linkable to the household records through the household identifier. The macros are
private to each data sheet; they are not in a workbook-level module.
Sheet protection
In the data sheets, the pre-defined variable names (column headers) cannot be changed
by field personnel. In addition, some columns of a technical nature remain invisible.
Field workers can, however, add variables of local interest in the unprotected area of a
data sheet.
21
Figure 6: Structure of a protected data sheet (household level)
Household
Serial No [Other [Substant - [Locally
number HHCode
RunNo
RecNo
= Hidden columns; record and running numbers assigned during master file assembly
= Protected field header cells
= Data entry area. Legal values supervised by background macro
= Locally defined variables. Not protected. Data not supervised
• Hidden columns are for operations by the central office. In the FIVDB baseline
survey, the central office assigns two record identifiers while appending all the
village-wise data tables into one master workbook: a running number within
each district, and the continuous record number for the full data set.
• The data entry operator in the field can access the red and green areas.
• In the red area, the field names (= the cells in the top row) are locked. They,
together with the rest of the table, can be copied.
• The column order cannot be changed. Rows cannot be inserted or deleted (the
contents of cells, though, can be removed). Data is entered starting in row 2.
Record-identifying variables are on the left side of the red area. Although the rules of
forming unique identifiers do not directly affect data entry (and vice versa), the template
can include elements that calculate and control identifiers. Record identifiers are the
subject of a sidebar beginning on page 26.
Variables of substantive interest are on the right of the identifier variables, followed by
authoring information (interviewer name, date of the interview).
The green area to the right is unlocked and unsupervised. It can be used to
accommodate additional variables, some of common use, others of a purely local
interest.
22
What the macros do
Each data sheet has an invisible macro attached to it. The macro kicks in whenever the
content of a supervised cell is changed. It does not need to be triggered by a user
command. It
• supervises whether cell values are legal in a given variable and flags errors
• checks whether household numbers are unique
• activates the next cell to the right when a legal value is entered
• depending on the response, skips several columns and activates the cell in the
next relevant variable
• activates, and scrolls to, the cell at the beginning of the next record when data
entry in the current record is complete, automatically creating a new serial
number.
The last two bullet points imply that the macro can mimic interview filters and do a
"carriage return to the next line".
6
Note that the records using the initial category definitions will not be updated when a set is changed
after the start of data entry in the concerned table.
23
cell in any of several fields dedicated to hold the multiple-response information.
Why the prefixes such as "04-"? By assigning numeric (or alphanumeric) prefixes, users
obtain a desired sort order when they later tabulate data. A table containing this variable
will thus return "Labor sale" in the first row or first column and "Other income" in the
last.
Assume now that a monitor receives several questionnaires noting "bee keeping" as a
significant source of income. He wishes to add this as a distinct option to the income
sources that the workbook will recognize. All he has to do is to decide on prompts
(codes, abbreviations) and on a prefix that assigns the option a particular place in the
sort order of future tabulations. The additional pairs of prompts and values are simply
written to the bottom of the set area, without any blank rows. In this example, "26",
"Bee", "Bees", "Bee keeping" as well as "26-Bee keeping" will all return the same string
- "26-Bee keeping".
We will now describe Excel features that were used as essential building blocks for this
data entry template. This section is technical. Readers not interested in this detail may
fast-forward to page 38. They may still benefit from browsing the demonstration
workbook and from practicing data entry in a few fictitious records.
At first we will describe some features that use Excel menu options and in-cell formulas.
These are more widely understood than VBA code. Subsequently we will detail the
macro structure and the detailed code.
Every cell range holding a category set is a named range. This named range is dynamic.
It expands and contracts automatically if and when elements are added or eliminated 7.
The dynamic ranges in the category set sheet use the functions OFFSET and COUNTA.
7
Dynamic ranges do not appear in the drop-down name list to the left of the function bar. They do appear
in the Name Manager (under Formulas in Excel 2007). We use the formula supplied by OzGrid (OzGrid
Undated).
24
For example, if the set has been placed with its left-column title in cell R1C44, then the
formula
=OFFSET(Hlists!R2C44,0,0,COUNTA(Hlists!C44) - 1,2)
will work as follows: COUNTA(Hlists!C44) returns the number of numeric and text
entries in column 44 of the sheet named "Hlists". Since the lookup table must not
include the title row, we subtract 1. This is the number of prompt-and-category pairs
included in the table. The function OFFSET with the arguments given here returns a
reference to a range. This range begins at R2C44 as it upper left cell [the two arguments
after R2C44 both are zero, meaning no offset from the start cell]. It is
[COUNTA(Hlists!C44) - 1] cells' high (downward from, but including R2C44), and two
cells wide (rightside from, but including R2C44).
In general terms, in order to create such a dynamic named range, template designers go
to "Formulas" "Defined Names" "Define Name" in Excel. Here they enter a name
for the range and type in the field "Refers to:", using a suitable instance of
All named ranges in the category set sheets are dynamic. If items are added, eliminated
or modified, VLOOKUP will find lookup tables already updated accordingly. Note that
cell entries made with earlier versions of the category set will not be updated; they
produced static values, not linked formulas.
Several users suggested including also the full spelling of every category in the prompt
column. In other words, in the earlier example of "26-Bee keeping" among income
sources, a pair "26-Bee keeping" [prompt] - "26-Bee keeping" [category] would be part
of the concerned list. The function of such fully spelled-out prompts is to prevent error
messages from actions that, we believe, are rarely taken. That is, an error is flagged
when the user returns to a correctly filled data cell, clicks the formula bar (edit mode),
does not make any changes, and yet clicks the Enter button (or hits the Enter key). To
satisfy this demand, we have included such prompt-category pairs for all categories in
our template lists. But it is not obvious why users will want to go into edit mode in a
cell that they will not change; thus template designers should feel free to adopt this
precaution - or not. If taken, it will indeed forestall an error in this type of action.
25
Figure 8: Excel sheet protection menu (under Cells - Format)
In the FIVDB baseline survey, unique household codes are formed by concatenating district,
sub-district (Upazila), commune (Union), village and household information. In general, it is
sufficient to concatenate the first four letters of the administrative unit names. However, in the
first survey round, using first-four-letter codes for villages led to errors. Neighboring villages in
the same Union might be called X-village-aaaa and X-village-bbb, for which mechanical
extraction produced duplicates. Not uncommonly, East-X-village and East-Y-Village types of
8
The AutoFilter option appears when scrolling down in "Allow all users .. to:".
26
village names would occur, also prone to confusion. Village codes thus have to be selected
manually, to make sure that each be unique.
For the household element, the concatenated identifier uses the dwelling number on the maps
that Community Learning Center volunteers had drawn of their villages. These numbers were
unique (outlines of buildings with several households were divided by lines). For uniformity,
each of the concatenated identifiers consists of four digits. In this anonymized village in
Brahmanbaria District, Nasirnagar Upazila and Kunda Union, sample household identifiers are
shown in the figure below. Note that serial and household numbers are not identical; dwelling no.
8 was not inhabited, or its residents were not interviewed.
The concatenation uses combined Excel functions. Given the above column numbers, the
household identifier is formed by:
=CONCATENATE(RC[-4],"-",RC[-3],"-",RC[-2],"-",RC[-1],"-
",TEXT(RC[-5],"0000"))
The household member table includes the household identifier, using the same formula. The
individual identifiers are formed by concatenating household identifier and serial number. This is
a convenience arrangement; there is no natural numbering of members in the households
interviewed. Some may prefer to use the serial number of the demographic sheet in the
questionnaire, but it is unlikely that this paper document will be at hand in future references.
Figure 9 shows the identifiers of a few members of the first two households.
27
The household member identifiers, given column numbers, are produced by
=CONCATENATE(RC[-1],"-",TEXT(RC[-7],"0000"))
In both formulas, the function TEXT with the optional argument "0000" forces the numeric
elements to be included with four digits. This is necessary for proper sorting.
The unique identifiers are basic preconditions for linking household and person tables. Survey
analysts may want to attach a household variable (e.g. wealth rank) to household member
records. In the opposite direction, they may want to attach a function of the household member
set to the household table (e.g., the age of the household head). In Excel, this is easily done
with the help of the function VLOOKUP and Pivot tables.
The mechanics of identifier production are simple. The template provides the formulas in the
first record. Data entry persons copy them downwards. There are, however, motivational
challenges. It is essential to explain the function of these constructs to data entry and analysis
personnel, both from the head office and field-based. Our training offered intensive practice on
how VLOOKUP and Pivot tables serve to link tables. Care for arcane requirements like
composite identifiers can be expected of workers who are encouraged to use the data in their
local work environments. In other words, workers who feel they have an intrinsic stake in the
data, and who understand that the tedium of data entry ultimately serves an intelligible and
personal aim.
At first glance, this problem seems to have nothing to do with data entry as such.
However, the formatting of multiple-choice data that is the most convenient for the
analyst may not be the most efficient for the entry operator, and vice versa. Some
statistical programs have created special procedures (e.g. for STATA, Jann 2005),
reason for us to look at the problem more closely in a sidebar. For data entry in Excel,
we favor a particular approach. In a second step, at the time of analysis, the original
variables can be transformed into a format more amenable to tabulation and modeling.
9
We found that not all field monitors would immediately grasp the difference between "the percentage of
households that have a loan from the Grameen Bank" and "the percentage of current loans that are from
Grameen".
28
template may provide each of them with his own field. Technically, for each provider a binary (or,
if you like, indicator) variable (yes/no, 1 or 0, and the like) is created; at data entry each is filled
with the positive or negative value (unless there was no response to the question, in which case
they are left blank). This representation is known as the indicator mode (Jann, op.cit., p.93).
If the choices are numerous, or if some of interest are not known in advance, it may be easier
on the data entry persons as well as analytically beneficial to define multiple generic variables
like "Provider1", "Provider2", etc. For this solution, adaptable category sets are part of the
technical answer. At data entry, monitors add providers not yet included in the default list. The
macros then ensure that the correctly spelt provider name is inserted into the field. Entry
continues for as many providers as were reported by the household. The additional generic
fields remain blank. This is called the polytomous mode, meaning: with more than two choices.
Polytomous multiple-choice formats are analytically cumbersome. Their translation into a set of
indicator variables is easily accomplished with a formula that combines the functions IF, MATCH,
and ISERROR, and makes use of mixed cell references. Given that the polytomous data was
entered in columns 19 - 23, it is extracted into indicators by
=IF(ISERROR(MATCH(R1C,RC19:RC23,0)),0,1)
as in this screenshot. The formula is identical for all indicators of loan providers.
29
Figure 13: Transforming polytomous variables into indicators
For smaller microfinance providers that are rarely found, specific indicator variables may not be
useful. Their presence in the survey households can be lumped into a residual count variable,
such as MF_other in column 92 above. The variable is calculated as the difference between the
polytomous entries and the sum of the indicator variables, i.e.:
= COUNTA(RC19:RC23) - SUM(RC88:RC91).
Later analysis may then proceed through Pivot tables or through named ranges used in
functions. In the example, let us assume that we calculate the number of loan providers to
10
whom a household is indebted as "NoProviders", and thus name the data vector as well .
Assume that HHcode, the household code, is also the name of the range holding all the codes
in the table, and that the maximum number of current loans from the same provider to a
household is one.
= SUM(NoProviders) / COUNTA(HHcode).
Once monitors understand the polytomous and indicator modes, both entry and analysis of
multiple choice data can proceed efficiently, and with few additional devices needed. It is
obvious that this approach has its weaknesses. Notably, the control of missing values is not well
handled.
30
the other to the household member sheet. The macros are tailored to respond to changes
in the values individual cells (not to those affecting a range of cells). They both follow
the course of action abbreviated in the flow diagram below.
31
The detailed code that executes action in this flow diagram is given in the appendix.
Blocks of code lines that detail the response to particular subsets of variables are
contained between lines and start with appropriate comments. This should enable
readers with basic VBA knowledge to adapt the macros intuitively to the needs of their
particular data entry templates.
Obviously, the question of more general interest is how the macros recognize types and
subtypes of variables and supervises cell entries appropriately to each.
Before writing any macro code, the template designer determines what treatment each
variable should receive. Treatments must determine the response to these questions:
Some variables call for the same treatment. For example, the fields that hold loan
providers (such as our NGO1, NGO2) justify uniform treatment. Other variables
demand a special treatment each for itself alone. For example, the serial number at the
beginning of the visible record must be consecutive and unique integers. For the
household numbers, we expect unique integers, but gaps are allowed (dwellings without
an interviewed household).
At this point, we will only exemplify each of various ways of letting the macro know
whether a variable is to be treated individually or as a member of a group of variables
all calling for the same treatment. We give brief examples and hyperlink them to the
beginning of the relevant code block in the appendix.
As the event macro executes, it compares the name of the variable in which the change
occurred to a sequence of instructions. Code for specific variables, or for groups of
variables all to be treated the same way, is written in blocks separated by dash lines. If
the variable belongs in a particular block, it executes the instructions and (after some
internal house-keeping) terminates. If the name is not found yet, it continues evaluating
membership block by block. If the variable belongs nowhere, the macro terminates
without visible action. This is the case for some standard variables (e.g., the four-letter
codes for administrative units) as well as for any local variables created outside the
protected data sheet area.
In event macros, the range being changed is known as the "target". This holds also for
single-cell changes. The macro acquires the name of the concerned variable through the
expression "Cells(1, Target.Column).FormulaR1C1". It is used numerous times; for
brevity, a convention is made before the first use, defining
t = Cells(1, Target.Column).FormulaR1C1.
32
Henceforward, the macro "understands" that t refers to the target variable name.
If t = "SerialNo" Then
[execute treatment]
[move to end of macro code]
Else
[do nothing]
End If
[move to next block]
The code block for this variable can be inspected through this link.
33
The code block for this variable can be inspected through this link. The intersect method
has the advantage that variables can be added to or removed from the group by simply
changing a name definition.
The code block for this variable can be inspected through this link. The InStr function is
useful if a text string, or several text strings, are used in multiple variable names.
However, the target cell cannot be updated with a self-referential formula such as
Target.FormulaR1C1 = _
"=vlookup(Target.FormulaR1C1, [lookup table], 2, false)"
This produces an error. Instead, the result first has to be stored in a named cell. In earlier
versions of the macro code, we designated a cell in the data sheet itself to be the transfer
point. However, this solution was cumbersome and even error-prone. If on the right side
of the protected variable names, it could interfere with the addition of local variables. If
in a hidden column on the left, it would confuse central office staff when they unprotect
the sheet to append it to a master file.
34
One solution is to add a temporary worksheet that does the transfer job and is deleted
just before the macro ends. The data entry person never sees it. Thus, in the opening
sections of the macro, we find these lines:
Later, when the code finds the variable in point, it calls on the TransferCell:
Application.CutCopyMode = False
'Removes the blinking border from the copied cell
[Etc.]
Just before the macro ends, the name of the transfer cell and its sheet are deleted:
Application.DisplayAlerts = False
'To avoid a prompt when deleting the temporary sheet
ActiveWorkbook.Names("TransferCell").Delete
Sheets("TempForTransit").Delete
[some housekeeping instructions]
End Sub
We are convinced that there must be more elegant solutions to this problem, and are
grateful for suggestions.
Skipping variables
The command
activates the next variable in the current record, one column to the right. This is the
default step in row-wise data entry.
Depending on the response to a question, one or several of the next variables in the data
sheet may remain blank. In the interview, the corresponding questions were skipped; in
35
the data sheet, a cell several columns to the right is activated. For example, records of
households without any current loans will cause eight variables to be skipped:
However, data entry templates may undergo changes, some of which may affect the
composition and order of the variables. In this case, a set number for the variables to
skip (as in "Target.Column + 9") may lead to errors. It is safer to define the column
number of the next cell to activate in a different way.
By way of example, the monitors entering data of this baseline survey are allowed to
copy geographical information from the first records since it stays the same in a one-
village file. Therefore, once the household number has been accepted, the cursor jumps
directly to the name of the household head as the next variable. Since, for whatever
reasons, the number of fields between household number and name of the household
head could change, a different approach is chosen:
1. We name the cell that holds the variable name. In the example we name cell
R1C14 as "HHheadColHead" (admittedly not very elegant as a name!).
2. In the macro code, before any variables are treated, we refer to the column
number of this variable by
HhhColHead = Range("HHheadColHead").Column
3. In the code block for the household number, the "activate next variable"
instruction is
Cells(Target.Row, HhhColHead).Activate
However, it is desirable to be able to mimic the skipping and branching behaviors of the
questionnaire. A workaround is needed. For categorical variables, a one-letter prompt
can be defined in the category set in point and can be entered in the cell from which the
operator wishes to skip to the cell in some other column to the right. The macro can then
be refined to scroll to the location when this prompt is entered. However, the column
number needs to be passed to it through an appropriate range definition.
We demonstrate this with the variable set "Income sources". If no further sources are to
be entered, the cell in first column of the next relevant block is to be activated and
scrolled to. In this template, this is the variable "LoanHistory".
We add the prompt "z" to the list of IncomeSource codes in the sheet "Hlists", leaving
the corresponding value cell empty.
36
To make the column number immune to changes in the template, we define its column
header cell as the named range "LoanHistoryColHead" and put this convention among
the local range names near the beginning of the macro code:
LHistColHead = Range("LoanHistoryColHead").Column
Entering "z" in the income source data cell does not cause an error. But it must not just
move the active cell by one column to the right. This ElseIf clause takes care of the
"skipping" (in the interviewer behavior sense) and scrolling:
The full code block for this demonstration can be inspected through this link. This
skipping by entering a one-letter code has not yet been practiced in the field 12. Skipping
structures for other variable sets have not yet been added to the macros, with the
exception of the last field in the household member table, "Comment".
The restriction to one village is motivated by the control of household numbers, which
must be unique within a village.
However, not every one may want to follow this arrangement. To allow data from
several villages into one workbook, the uniqueness check for household numbers (not
for serial numbers!) can be switched off:
In the declaration section of the macro attached to the household-level data sheet, the
declaration section includes:
12
And will be seen by some as an inelegant solution. Suggestions for code performing these actions upon
simply hitting the Enter key are very welcome.
37
'Not checking uniqueness of household numbers
'opens the door to indexing errors.
The code concerning the household number at first checks whether a number has been
entered. If so, then this subordinate if-clause is executed:
We reiterate that multi-village data entry workbooks pose added risks. Without control
of the uniqueness of household numbers within a village, duplicate household codes
may ensue, and eventually errors in table linking. Also, if monitors copy geographical
information downwards without proper attention to the point where entry of a different
village begins, all floodgates for serious errors are open.
These precautions must not instill fear. They address real problems in the work of data
entry and analysis personnel. The code examples discussed above, however, are of
concern mainly to those who wish to adapt macros for new data entry templates. Once
these work, the macros are like invisible hands, quietly guiding and reducing anxiety.
In addition, FIVDB, as part of a plan to expand computer skills and facilities among the
Jonoshilon Community Learning Centers (CLCs), contracted some of the data entry out
to a computer center that seven CLCs had been jointly managing since 2007. The center
has so far processed about 6,000 household questionnaires. In fact, the FIVDB
monitoring coordinator considers this template the precondition for the successful
outsourcing of most of the data entry. We will report on this experiment in the sidebar
starting on page 42. Here we concentrate on the performance of the new entry template
as it has been used inside FIVDB.
Speed
The monitoring associates agree broadly that it takes about 30 percent less time to
dispose of a household survey than it used to under the previous, drop-down list-based
template.
38
But there are aspects of the new template that the monitoring associates resent, and
which cost them time. The built-in sheet protection forbids cell comments; these had
been popular for a variety of purposes. Cell comments served as "to do" items for values
that required call-back with interviewers, and as justifications of outliers that otherwise
would provoke queries. Comments now have to be written into an unprotected field to
the extreme right of the tables, necessitating long moves from and back to the entry cell
last opened. Similarly, the undo button no longer works. When an associate makes an
entry error, the cell content has to be deleted, and the red cell background flagging the
error manually reset to "no fill". This requires alternating between key and mouse
movements.
Some users would have preferred to first enter the names of all members in a given
household rather than finishing the record for one member and then proceeding to the
next member record. They felt that the entry, vertically in the columns, of names, age,
and occupations would save considerable time. The template does not offer the option to
offset the next active cell vertically for certain variables.
There appears to have been limited use made of the option to add categories or to create
additional prompts for existing ones, in multinomial variables. Additional categories
with one-letter prompts were created for NGOs that were listed locally, and had not yet
been included in the pre-installed category list.
Reliability
Data that the monitoring associates in the field offices themselves enter reaches the head
office as digital files, some sent by e-mail. The questionnaires remain in the field
offices. The PPR Unit coordinator and research associates perform a number of
consistency checks.
Inconsistencies - e.g., between "student" as occupational status and age outside the
typical range for students - are rarely found. Even assuming that part of the entry errors
are not caught, they appear much rarer than errors committed during the interviews.
This belief is based on the number of errors that the associates detect during data entry.
Nevertheless, some users requested built-in validation rules, such as a minimum age for
those with a known occupation.
When we contract out data entry work, we use more extensive checks. For an
illustration, see the sidebar below, especially on page 43.
39
Appended data sets do not by themselves make a master table. Record IDs have to be
created in the leftmost column. The categories and values used in the variables have to
be inspected for missing values, and whether they are consistent and within reasonable
range.
These requirements have not changed with the new template. However, since the field
offices enjoy the freedom to add categories locally, the final category sets have to be
reviewed all the more carefully. Some of the variables may not be usable in their initial
raw form, and new variables may have to be recoded from them. For example, age is
easier accessible through distinct age groups. Indeed, one of the key competencies in
survey data management, closely following data entry, is the ability to recode variables.
We devote the next sidebar to this operation.
Shortcomings
How well field and head office workers, collaborating through this template, succeed in
producing master tables of the survey data cannot yet be finally assessed.
One of the ambitions using this template is to encourage the early use of data in local
field offices, long before the head office distributes copies of master tables that combine
data from several offices. Field-based associates know how to use the workhorse of
descriptive statistics in Excel - Pivot tables -, but they find it inconvenient that they have
to make separate copies of the data tables in order to create Pivot tables. These cannot
be made off the protected sheets.
When the central unit appends the village-wise tables into the full-survey master tables, care
must be taken to evaluate the combined category sets, to absorb excessive, incoherent or rarely
used categories into broader ones, and to be clear about the rationale for the final number and
content of distinct categories.
In many NGOs, the central unit does not give its field monitors any discretion to adapt response
categories although they can be meaningful both locally and organization-wide. The reason for
fixing sets in advance is the lack of data management skills, specifically the inability to gain an
overview of all categories created in the primary data entry and to prune back excessive and
incoherent ones.
40
In Excel, a full listing of all variants used in a variable is quickly established, together with their
frequencies, by way of a Pivot table. The variant list can be copied into a fresh sheet, to create a
table of variants and their desired replacements. Conceptually, this is equivalent to the
operation of recoding a categorical variable known from statistical programs. Technically, an
Excel spreadsheet variable is best recoded into a new variable using the function VLOOKUP.
One of the charms of Excel is that it is easy to try out different recoding schemes before settling
on a final one. We will presently demonstrate this for a numeric variable.
In low-skill monitoring units, dealing with numeric variables is often even more inept. For
example, interviewers are instructed to record household income by ranges, e.g. $ 0 - 10,000,
$ 10,001 - 20,000, etc. This creates a categorical variable where the continuous variable would
be much more informative. Often the ranges fixed in advance have no theoretical or practical
grounding and lead to invalid conclusions while obscuring the really important aspects of the
distribution. This approach is nevertheless chosen because the monitoring unit does not know
who to handle continuous data.
Age is another example. Rounded up or down to full years, it is a discrete variable, but with so
many values that a treatment by age groups will be necessary for certain analyses. How these
age range categories are to be formed depends on analytic purposes, and should not be fixed
beforehand. Again, the function VLOOKUP, properly used, allows us to be flexible. One can
have several recoding schemes side by side.
The screenshot shows a range named "vlookage" offering three different recoding scheme for
13
age. For example, a 6-year old child would be placed in the age group 0 - 9 years in scheme #
13
The notation "00 - 09" ensures the proper sort order in Pivot tables.
41
1 and 2, and in age group 5 - 9 years in scheme # 3. In the household member table, a new
variable "age group" can thus be calculated, for the chosen scheme, as
Similarly, categorical variables that require an exact match with their recodings, are easily
recoded with the help of VLOOKUP(.., .., .., FALSE). The data management training of the
FIVDB monitoring staff devoted a full day to recoding techniques in Excel. The formula syntax, it
turned out, posed a minor challenge in comparison with the initial conceptual difficulties.
The center is located some 30 km north of Sylhet city, and about 40 km from the FIVDB head
office, close enough to benefit from urban resources, and far enough for an FIVDB support
person to devote half a working day for any service call.
Founded in March 2007 with help from UNESCO, the center was financially supported by
FIVDB until June 2008 and was then told to become self-supporting. The participating CLCs
kept it alive through subscriptions that paid for rent and electricity. Practically, the center found
little commercial activity - most of it in leaflets and invitation cards -, and a good part of its
equipment broke down, three of its five computers and, for some time, the internet access. It
badly needed other income.
Mixed motivations
A medley of motivations prompted FIVDB to entrust part of the baseline survey data entry to
such a little-proven outfit. Financial viability and the idea of involving the CLCs in the
management of the data that concern their communities were two of them. Other anticipated
benefits included relief for the monitoring associates, who, in theory at least, could then spend
more time in the field, and the need to find a viable model for the ICTs. Under the Jonoshilon
project, FIVDB has undertaken to open, by 2013, as many as 25 IT resource centers in the
project villages and to expose close to 8,000 CLC members and Class V primary school
students to computers.
42
Figure 16: Data entry in a computer center in the villages
Understandably, the cooperation with Khagamura has led to differing emphases in the opinions
of FIVDB staff:
• Technical staff in the head office urge a reality check in terms of the effort it takes to
maintain equipment and to chaperon the social organization supporting it; already
FIVDB has had to reorganize the Khagamura committee when it became obvious that
only two out of the seven CLCs had participated. In fact, in early 2011, the four most
active volunteers moved the equipment to a room rented from a house owner in their
extended family, thus essentially making this ICT a private concern.
• Finally, the monitoring associates in the field offices pointed out that shifting the data
entry away from them deprived the survey of its field editing function: the only way to
43
perform anything close to field-editing is to check questionnaires during data entry in the
very offices to which the interviewers (who, in their normal duties, are frontline workers
of FIVDB line departments) report daily, and to tackle anomalies directly with the
concerned ones. CLC-mediated (or, for that matter, any other outside) data entry
arrangements could not routinely access the interviewers. This point was compelling; as
we have seen, an additional loop was added to the process, by sending data files back
to field offices for checking.
A learning process
Outside observers would probably make three points:
• First, innovation is ongoing; the pull of opportunity seizes even something as mundane
as survey data entry.
• Second, as often in NGOs, objectives become hybrid. The original function of data entry
has been vastly stretched to accommodate financial, organizational and training
considerations that are extraneous to the logic of surveys.
• Third, work processes of the same kind become diverse, with part of the data handled
under one organizational arrangement, and another part under a different arrangement.
This is likely to reinforce the tendency to produce data with strongly correlated errors.
These occur because different instructors of the survey personnel give different
guidance (Bassi and Fabbris 1997) - say, one set of instructions for monitoring
associates vs. the re-interpretations that semi-autonomous contractors make of them.
On the other hand, a diversity of a work processes has the advantage to elaborate
organizational alternatives. The Khagamura center workers have learned to use the new
template, after only minimal training on the job. FIVDB considers this an investment into a
learning process in grassroots organizations. At the same time, FIVDB itself benefits from
outsourcing data entry. Its monitors agree that the template, with its error detection facility and
efficient prompts, is the precondition for this arrangement to work.
Outlook
We offer a template for survey data entry that may be adaptable to the needs of a range
of development NGOs. It may attract organizations that have found that their own ad-
hoc arrangements for data management do not suffice, and that they need tools that
make the entry process more efficient. Within this group, the template is suitable for
those which use the spreadsheet program Excel, rely on trained or trainable users to
enter data, and have access to the (modest) macro code writing skills needed to adapt it
to the tables and variables into which their survey data will flow.
Those are basic requirements. One of our higher purposes is to encourage decentralized
data entry, particularly with a view to rapidly plowing back data to grassroots
organizations that helped collect them, and to promote and facilitate analysis and usage
in and around field offices. FIVDB has invested in the training of its monitoring
associates to a point where most of them know, not only how to enter data, but also how
to produce simple descriptive statistics. This multiplies possibilities of supporting field
units and communities with useful information. With no extra effort, some of it can be
44
cast in a comparative mode that goes beyond the participatory analysis forms contained
within one small community, yet remains accessible and stimulating to deliberating
citizens.
Decentralized data entry is not without its challenges, and it should not be romanticized.
Professional supervision may be more difficult to extend. There are costs. FIVDB has
had to install generators in field offices to ensure adequate computer access for survey
data entry as well as for other uses. We assume, but do not have proof, that data entered
in field offices is somewhat more reliable than under centralized arrangements, chiefly
because the monitors benefit from almost daily interaction with the interviewers.
This serves as a second-best to the kind of formal field-editing that we would see in a
professional survey setup, but it does eliminate a number of errors, to a degree that
centralized operations achieve with difficulty. This is in line with what others have
found. In a randomized experiment with decentralized data entry in Vietnam, Glewwe
and Dang (2008) concluded that decentralized entry reduced errors and inconsistencies.
It saved time, particularly in data checking.
Realistically, this template may serve a mid-range group of NGOs engaged in surveys.
"Mid-range" applies both to current human resources and to the dissemination of new
technology. It is not helpful for groups that do very small surveys, or who cannot find
the skills to adapt the macro code. It will not be attractive to those with well-endowed
research departments nor to those who contract out most survey work. Moreover, the
advances that are made in mobile phone-based data entry, at present most resolutely
among relief agencies, will ultimately be embraced by development NGOs as well.
Technology will make paper-and-pencil interviewing and subsequent data entry
obsolete.
We are not yet there. The template, as developed and tested in FIVDB, may find a
congenial audience in NGOs that still use paper and pencil, are concerned about data
reliability, and appreciate spreadsheet technology as a platform on which workers of
diverse professions, skill levels and departmental functions can meet, learn and
collaborate. In fact, as we have happily observed since the first round of the Jonoshilon
baseline survey, the survey process itself changes role definitions. Monitoring associates
who used to be looked down upon as lowly clerical workers are actively consulted for
data management problems beyond surveys. Other departments have started pinning
hopes on the monitoring unit's savvy and discipline to help them move towards data
collections that might answer questions of outcome and impact. Villagers meeting in
Community Learning Centers are intrigued by comparisons to neighboring communities.
It would be an exaggeration to claim that data entry catalyzes any of those processes,
but it is becoming less of an obstacle to them.
45
References
Ali, M., J. Park, et al. (2006). "Organizational aspects and implementation of data
systems in large-scale epidemiological studies in less developed countries."
BMC Public Health 6: Article 86.
Barchard, K. and L. Pace (2008). "Meeting the challenge of high quality data entry: a
free double-entry system." International Journal of Services and Standards 4(4):
359-376.
Bassi, F. and L. Fabbris (1997). Estimators of Nonsampling Errors in Interview –
Reinterview Supervised Surveys with Interpenetrated Assignments Survey
Measurement and Process Quality L. Lyberg et. al. New York, John Wiley and
Sons, Inc.: 733-751.
Burnett, M., C. Cook, et al. (2002). End-user software engineering with assertions
[Technical Report 02-60-05]. Portland, Oregon State University: 103.
CDC (2005). Epi Info™ Community Health Assessment Tutorial [version 2.0]. Atlanta,
Centers for Disease Control, Epidemiology Program Office.
Chambers, R. (2008). Revolutions in development inquiry. London and Sterling, VA,
Earthscan.
Cork, D., M. Cohen, et al. (2003). Survey automation: report and workshop proceedings.
Washington DC, National Academies Press.
Couper, M. (2005). "Technology trends in survey data collection." Social Science
Computer Review 23(4): 486-501.
de Silva, N. and N. Gunetilleke (2008). "On trying to be Q-Squared: Merging methods
for a technical minded client." International Journal of Multiple Research
Approaches 2(2): 252-265.
FIVDB (2008). Jonoshilon. Popular Education. An Education Program Proposal
[September 2008 – August 2013. Submitted to: Canadian International
Development Agency and Embassy of the Kingdom of the Netherlands (EKN)].
Dhaka and Sylhet, Friends In Village Development Bangladesh.
Glewwe, P. and H. Dang (2008). "The impact of decentralized data entry on the quality
of household survey data in developing countries: Evidence from a randomized
experiment in Vietnam." The World Bank Economic Review 22(1): 165.
Groves, R. M., F. J. Fowler, et al. (2004). Survey methodology. Hoboken, NJ, J. Wiley.
Jann, B. (2005). "Tabulation of multiple responses." Stata Journal 5(1): 92-122.
Kim, W., B. Choi, et al. (2003). "A taxonomy of dirty data." Data Mining and
Knowledge Discovery 7(1): 81-99.
McRitchie, D. (2008). "Event Macros, Worksheet Events and Workbook Events."
Retrieved 1 March 2010, from
http://www.mvps.org/dmcritchie/excel/event.htm#ws_activate.
46
Nicholls, W., R. Baker, et al. (1997). The effect of new data collection technologies on
survey data quality. Survey measurement and process quality. L. Lyberg and e.
al. New York, John Wiley and Sons: 221-248.
OzGrid. (Undated). "Dynamic Named Ranges." Retrieved 4 July 2010, from
http://www.ozgrid.com/Excel/DynamicRanges.htm.
Pearson, C. (2010). "Events And Event Procedures In VBA." Retrieved 5 August 2010,
from http://www.cpearson.com/excel/Events.aspx.
Rosenbloom, S., R. Miller, et al. (2006). "Interface terminologies: facilitating direct
entry of clinical data into electronic health record systems." Journal of the
American Medical Informatics Association 13(3): 277-288.
Sana, M. and A. A. Weinreb (2008). "Insiders, Outsiders, and the Editing of
Inconsistent Survey Data." Sociological Methods & Research 36(4): 515-541.
Tomlinson, M., W. Solomon, et al. (2009). "The use of mobile phones as a data
collection tool: A report from a household survey in South Africa." BMC
Medical Informatics and Decision Making 9(1): 51.
U.S. Census Bureau (2009). CSPro. User’s Guide, Version 4.0. Washington DC, U.S.
Census Bureau, Population Division, International Programs Center.
47
Sample macro code
This code is attached to the worksheet "HHlevelData". The code attached to the second data sheet in the template workbook,
"HHmembersData", is structurally similar and is not printed here.
'Also moves active cell to the right if data entry correct and down to the next record when done
'-----------------------------------------------------------------------------------------------
'If true, HHlevelData table meant to hold data only for one village. This is the default.
'If false, uniqueness of household numbers not enforced.
'Not checking uniqueness of household numbers opens the door to indexing errors.
OneVillageOnly = True
'-----------------------------------------------------------------------------------------------
48
'Do nothing if a field name in the top row is modified:
If Target.Row = 1 Then Exit Sub
'-----------------------------------------------------------------------------------------------
'Field name (=formula in top row) in column where a cell value is being changed
'Its function is to refer to fields ([here:] = variables)
'not by column numbers (which may change),but by variable names:
t = Cells(1, Target.Column).FormulaR1C1
'-----------------------------------------------------------------------------------------------
'Delays updating screen until reinstated before exit or at the end of the macro
Application.ScreenUpdating = False
'-----------------------------------------------------------------------------------------------
'Name a transit cell in it, which is later needed to pass lookup values:
ActiveWorkbook.Names.Add Name:="TransferCell", RefersToR1C1:="=TempForTransit!R1C1"
49
'probably because the procedure is private
Worksheets("HHlevelData").Activate
'-----------------------------------------------------------------------------------------------
'ERROR PREVENTION:
On Error Resume Next 'Stop any possible runtime errors and halting code
Application.EnableEvents = False 'Prevents endless loops due to multiple action in the target
'-----------------------------------------------------------------------------------------------
'START PROCESS BY ENTERING SERIAL NUMBER'
If t = "SerialNo" Then
d = 0
With Range(Cells(2, Target.Column), Cells(Target.Row - 1, Target.Column))
Set c = .Find(Target.Value, LookIn:=xlValues)
If Not c Is Nothing Then
firstAddress = c.Address
Do
d = d + 1
Set c = .FindNext(c)
Loop While Not c Is Nothing And c.Address <> firstAddress
End If
50
End With
End If
End If
Else
End If
'-----------------------------------------------------------------------------------------------
'ENTERING HOUSEHOLD NUMBER
'Make sure field name for household number is spelt correctly, as Hhno, not HHno:
If t = "Hhno" Then
51
Target.Interior.Color = 255
Cells(Target.Row, Target.Column).Activate
Else
'If indeed it is numeric, then, depending on restriction to one village:
If OneVillageOnly = True Then
'Entry error signal if same serial number as in cells above again used:
d = 0
With Range(Cells(2, Target.Column), Cells(Target.Row - 1, Target.Column))
Set c = .Find(Target.Value, LookIn:=xlValues)
If Not c Is Nothing Then
firstAddress = c.Address
Do
d = d + 1
Set c = .FindNext(c)
Loop While Not c Is Nothing And c.Address <> firstAddress
End If
End With
'Scroll to active cell, but make sure previous record remain visible:
Application.Goto Reference:=ActiveCell.Offset(RowOffset:=-1), scroll:=True
Cells(ActiveCell.Row + 1, ActiveCell.Column).Activate
52
End If
Else
'If not restricted to one village:
Cells(Target.Row, HhhColHead).Activate
'The danger is in copying the geographical information downwards, with potentially
'incorrect village information. This process, however, remains manual.
End If
End If
Else
End If
'-----------------------------------------------------------------------------------------------
'When entering DISTRICT information:
If t = "District" Then
'Transferring the prompt to the vlookup formula, which then returns the full district name:
Target.FormulaR1C1 = "=vlookup(TransferCell, District, 2, false)"
Target.Copy 'These two lines replace the formula with its value
Target.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=False
53
Application.CutCopyMode = False 'Removes the blinking border from the copied cell
End If
GoTo Line1000
Else
End If
'-----------------------------------------------------------------------------------------------
'When entering UPAZILA information:
If t = "Upazila" Then
54
If Target.FormulaR1C1 = "#N/A" Then
Target.FormulaR1C1 = "Entry error"
Target.Interior.Color = 255
Cells(TargetRow, TargetCol).Activate
End If
GoTo Line1000
Else
End If
'-----------------------------------------------------------------------------------------------
'When entering UNION information:
If t = "Union" Then
55
Cells(TargetRow, TargetCol).Activate
End If
GoTo Line1000
Else
End If
'-----------------------------------------------------------------------------------------------
'When entering VILLAGE information:
If t = "Village" Then
56
TargetRow = Target.Row
TargetCol = Target.Column
Cells(TargetRow, TargetCol + 1).Activate
End If
GoTo Line1000
Else
End If
'-----------------------------------------------------------------------------------------------
'For the rest of this data entry sheet, several variables are grouped together,
'as far as the macro is concerned. The variables in a block are treated the same,
'to abbreviate the code. If the macro finds that the target cell is in such a variable,
'it executes the block instructions (e.g., if it is a numeric variable: requiring numeric input),
'and then does some house-keeping and exits the macro.
'Else it moves to the next block.
'-----------------------------------------------------------------------------------------------
'-----------------------------------------------------------------------------------------------
'When entering INCOME SOURCE information:
'Example of skipping if no data to enter and scolling to next block [here: Loan History]
'Type "z" if no information to be entered and you wish to skip to
'The vlookup function will then return zero. The ElseIf-clause below removes it.
'It activates the cell in the column skipped to, and scrolls to it, for convenience
'leaving one record above visible.
57
'Note: The choice of this range name as isect is arbitrary.
'Scroll to active cell, but make sure previous record remain visible:
Application.Goto Reference:=ActiveCell.Offset(RowOffset:=-1), scroll:=True
Cells(ActiveCell.Row + 1, ActiveCell.Column).Activate
'Return to section on skipping and scrolling when no data
End If
Set isect = Nothing
GoTo Line1000
Else
End If
58
'END CODE FOR THESE VARIABLES
'Return to main text.
'-----------------------------------------------------------------------------------------------
'-----------------------------------------------------------------------------------------------
'When entering data on income AMOUNTS (Income1, etc.):
'This particular block is called "IncomeAmounts" [see definitions of names]
'No values prescribed, but must be numeric positive
Else
Cells(Target.Row, Target.Column + 1).Activate
End If
Else
End If
59
'END CODE FOR BLOCK OF VARIABLES
'-----------------------------------------------------------------------------------------------
'-----------------------------------------------------------------------------------------------
'When entering past LOAN HISTORY or ANY CURRENT LOAN information:
Target.Copy 'These two lines replace the formula with its value
Target.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=False
Application.CutCopyMode = False 'Removes the blinking border from the copied cell
End If
GoTo Line1000
60
Else
End If
'-----------------------------------------------------------------------------------------------
'When entering ORGANIZATION and LOAN SOURCE information:
'[This block, although with two variable groups, is handled with
'If Instr(..).. Or .. , and with one lookup table only:]
GoTo Line1000
Else
End If
61
'END CODE FOR THESE TWO VARIABLE SETS
'Return to main text.
'-----------------------------------------------------------------------------------------------
'-----------------------------------------------------------------------------------------------
'When entering LOAN BALANCE information:
Else
Cells(Target.Row, Target.Column + 1).Activate
End If
GoTo Line1000
Else
End If
'-----------------------------------------------------------------------------------------------
'When entering the INTERVIEWER name:
62
If t = "Interviewer" Then
Else
End If
'-----------------------------------------------------------------------------------------------
'When entering the INTERVIEW DATE:
If t = "DateInterview" Then
63
ElseIf Target.Value < 40179 Then
Target.FormulaR1C1 = "Must be 1st Jan 2010 or later"
Target.Interior.Color = 255
Cells(Target.Row, Target.Column).Activate
Else
'Opens new record with consecutive serial number:
Cells(Target.Row + 1, SnoColHead).Value = Cells(Target.Row, SnoColHead).Value + 1
Cells(Target.Row + 1, SnoColHead + 1).Activate
'Scroll to active cell, but make sure previous record remain visible:
Application.Goto Reference:=ActiveCell.Offset(RowOffset:=-1), scroll:=True
Cells(ActiveCell.Row + 1, ActiveCell.Column).Activate
End If
GoTo Line1000
Else
End If
'-----------------------------------------------------------------------------------------------
'END OF MACRO:
Line1000:
'House-keeping before ending the macro:
Application.DisplayAlerts = False 'To avoid a prompt when deleting the temporary sheet
ActiveWorkbook.Names("TransferCell").Delete
Sheets("TempForTransit").Delete
Application.EnableEvents = True 'Turn events back on
64
On Error GoTo 0 'Allow run time errors again
Application.ScreenUpdating = True
Application.DisplayAlerts = True
End Sub
'-----------------------------------------------------------------------------------------------
65
[This page deliberately left blank]
66
Revisions and known bugs
The FIVDB PPR Unit started to use this template for the computer entry of data
collected during the second round of a household baseline survey in spring 2010.
Increasingly, starting in August, entry has been contracted out to a village-based
organization, the Khagamura ICT, which is using the same template (see page 42).
At present (March 2011), there are no known bugs in the template. Further testing
and adapting to the needs of other surveys will likely reveal hitherto unknown
malfunctions. Regardless, there is ample scope for improvement. In particular, as
noted on page 34, the use of a hidden temporary worksheet used in the transfer of
categorical values strikes us as extremely inelegant. Readers are encouraged to
propose better solutions. New ways to improve the code documentation will also
become manifest once the template is investigated for other survey contexts.
67
Author information
Aldo Benini is an independent researcher and consultant living in Washington DC, USA.
He has known FIVDB since 1980 and occasionally consults on monitoring matters.
Wasima Samad Chowdhury is the Coordinator of Policy Planning and Research (PPR)
Unit. She has been with FIVDB since 1996.
Saiful Hasan is a Research Associate, Abu Saeem Arif and Md. Yasin Mazumder are
Monitoring Associates, in the PPR Unit. They have been with FIVDB since 2009.
Contacts: