Guidelines for Working With Small Numbers
The Assessment Operations Group in the Washington State
Department of Health is coordinating the development of guidelines related to data development
and use in order to promote good professional practice among staff involved in assessment
activities within the Washington State Department of Health and in Local Health Jurisdictions
in Washington. While the guidelines are intended for an audience of differing levels of training
related to data development and use, they assume a basic knowledge of epidemiology and biostatistics.
They are not intended to recreate basic texts and other sources of information related to the topics
covered by the guidelines, but rather they focus on issues commonly encountered in public health
practice and where applicable, to issues unique to Washington state.
Scope of the "Guidelines for Working with Small Numbers"
Why are small numbers a concern in public health assessment?
What constitutes a breach of confidentiality?
Why do we question the reliability of statistics based on small numbers?
Guidelines for working with small numbers
How to reduce risk of confidentiality breach
How to address statistical issues
Glossary
References and Resources
Guidelines For Working With
Small Numbers (Word Document)
Public health data are typically presented in tabular
form or released as files of record level data. The following guidelines
address data presented in tabular form on paper or available
electronically, produced for or readily accessed by the public, for uses
other than mandated activities of state and local health departments.
These guidelines apply to total population data, not to survey data based
on a proportion of the population. Guidance on disclosing files of
individual records is provided in the DOH policy on release of
confidential data/information (Policy 17.006 Release of confidential
data/information).
Public health policy decisions are fuelled by information.
Often, this information is in the form of statistical data. Questions concerning health outcomes
and related health behaviors and environmental factors often are studied within small subgroups of
a population. Continuing improvements in the performance and availability of computing resources,
including geographic information systems, and the need to better understand the relationships
between environment, behavior, and consequent health effects have led to increased demand for data
on small populations. These demands are often at odds with the need to preserve privacy and data
confidentiality. Small numbers also raise statistical issues concerning the accuracy, and thus
usefulness, of the data.
In general, problems with confidentiality arise when
there are small denominators (population size represented in a specific cell in a table); and,
problems with data reliability arise when there are small numerators (cases in a specific cell
in a table).
A breach of confidentiality occurs when analysts
release information in a way that identifies an individual and reveals
confidential information about that person. The following guidelines
provide some cues to situations that present high risk for a breach of
confidentiality and some suggestions on how to reduce this risk. In
addition to these guidelines, analysts should become familiar with the
provisions of DOH Policy 17.006 and RCW 70.02, as well as confidentiality
requirements for specific databases as defined in state laws and
regulations. Please note that
state laws and regulations supercede guidance provided
in this document.
Estimates based on a random sample of a
population are subject to error due to sampling variability. Rates and
percentages based on a full population count are also subject to random
variation (see other relevant Guidelines, listed at
the end of this document). The random variation may be substantial when the
measure, such a rate or percentage, has a small number of events in the numerator.
Typically, rates based on large numbers provide stable estimates of the
true, underlying rate. Conversely, rates based on small numbers may
fluctuate dramatically from year to year, or differ considerably from one
small place to another small place, even when there is no meaningful
difference. Meaningful analysis of differences in rates between geographic
areas or over time requires that the random variation in rates be
quantified; this is especially important when rates or percentages have
small numerators.
These guidelines address both confidentiality and
statistical issues in working with small numbers. In some DOH data
systems, such as the AIDS registry, the entire database is considered
confidential. In other systems, such as the birth certificate system, many
but not all data items are confidential. In yet other systems, such as the
death record system, none of the items are confidential.1
A first step in using these guidelines is to determine if the data set(s)
you are working with contain confidential information. If so, the following section on protecting
confidentiality needs to be carefully reviewed. Otherwise, you need only concern yourself with the
statistical issues section.
In general, problems with confidentiality arise when there
are small denominators (population size represented in a specific cell in a table); and, problems with
data reliability arise when there are small numerators (cases in a specific cell in a table). In larger
populations, it is more difficult to identify individuals from data released in tables. For example,
if there are 5,000 individuals in a specific age-race-sex group in a single county, the likelihood of
identifying a single individual from data in a published table is quite small. In smaller populations,
it is more likely that an individual might be identifiable. Further, even in larger populations, it is
conceivable that a single individual might be identifiable, if there are only one or two individuals with
some special characteristic. For example, in a modest-sized community, it may be common knowledge that
there is only one child who is frequently hospitalized, and a table showing that this community has one
case of pediatric HIV-AIDS could unintentionally disclose confidential information. Thus, it is desirable
to have rules for privacy protection which consider both denominator size and numerator size. Rules to
address statistical reliability can be limited to consideration of numerator size.
Examine denominator size for each cell. Prior to
disseminating tables that contain confidential information,
analysts should first consider the size of the denominators: the population size represented in each cell in the
table. Generally, tabular data based on denominators greater than 300
persons per cell present minimal risk for individual
identification. The risk of violating confidentiality increases substantially
when data are tabulated for small subgroups of the population within small
geographic areas. Caution should be exercised by the analyst if the population
size is between 100 and 300, and extreme caution is warranted when the population
is less than 100.
Examine numerator size for each cell. Second, data
analysts should consider the number of events in each cell of a table to be
released (i.e., the numerator for a rate calculation). If the count of cases or
events in a cell is less than three, the data analyst needs to consider whether
a breach of confidentiality is likely. A count of no events in the cell is clearly
no threat to confidentiality, but a count of one or two events may be.
The general approach to privacy
protection involves what has been termed "computational disclosure
control," which includes both aggregation of data values in the dataset
before analysis, and cell suppression in a table after analysis (Sweeney
1997).
Aggregation. Aggregation of data values is appropriate for
fields with large numbers of values, such as dates, diagnoses, and geographic areas; it is the primary
method used to collapse a dataset in order to create tables with no small numbers as denominators or
numerators in cells. The following table shows examples.
| |
Granularity:
Aggregation
|
|
Field |
Type |
Small |
Medium |
Large |
| Age |
Continuous |
Year of
birth |
5-year
age group |
10-year
age group |
| Date of
occurrence |
Continuous |
Day |
Month |
Year |
| Diagnosis |
Nominal |
Complete
ICD code |
Three-digit ICD |
"Selected cause" Tabulation |
| Geography |
Ordinal
(spatial) |
Zip
code, census tract |
Sub-county area |
County |
Cell Suppression. When it is not
possible, or desirable, to create a table with no small numbers as denominators or numerators in cells,
then cell suppression is used, together with complementary suppression. "Primary" cell suppression is
used to withhold the data (numerator) in the cell which fails to meet the threshold, followed by
suppression of three other cells in order to avoid inadvertent disclosure through back-calculation.
Note that cell suppression is a method of last resort, due to the often unavoidable side-effect of
suppression of releasable data values as a consequence of complementary suppression, and due to the
amount of manual labor necessary to implement the method. The following table shows an example:
supposing that the cell in the upper left (0-34 Black) did not meet the threshold for release,
regardless of whether for reason of numerator or denominator size.
| Age |
Black |
White |
Other |
Total |
| 0-34 |
S |
30 |
S |
60 |
| 35-64 |
S |
60 |
S |
150 |
| 65+ |
70 |
90 |
80 |
240 |
| Total |
120 |
180 |
150 |
450 |
Other Methods.When neither
of these methods (aggregation of data values to increase granularity, and cell suppression) are
satisfactory, two alternatives remain. The first, and better choice, is to combine multiple years
of data (which is a form of aggregation). The effect will be to increase the effective population
size, since the (usually unstated) denominator is actually "person-time" in rate calculations, and
the numerators are likely to rise correspondingly as well. The second alternative is to omit certain
fields from analysis entirely. A recent example involved the release of asthma data: it was not
possible to achieve adequately large cell denominators in annual county-level data showing both
age-specific and gender-specific counts and rates. An advisory group opted to omit the gender-specific
data, and display only tables of age-specific data, on the grounds that no intervention programs
targeted groups differently on the basis of gender, but most intervention programs target age
groups differently.
Group identification.In addition
to individual identification, analysts need to be alert to risks for group identification. Here,
something confidential is revealed about a group of individuals identifiable by their age, race,
or other reported characteristics. While this type of disclosure has received less attention than
individual disclosure, it represents an emerging concern and should be considered when deciding
whether to publish data. Note that this is more of a problem when the prevalence is high (over 80%)
than it is when the prevalence in the group is low.
In summary: The following practices can help assess
and reduce confidentiality risks:
- Be cautious when reporting
rates or ratios based on denominators less than 300 and extremely
cautious when denominators are less than 100.
- Be cautious when reporting counts less than 3.
- Be cautious when reporting a specific (confidential) characteristic
of a population if a very high proportion of the population has this
characteristic.
- When producing multiple tables, be careful that users cannot derive
confidential information through a process of subtraction.
- When producing multiple tables, consider prior preparation of a
"public use dataset" which has been created such that it is
not possible to create any tables which have any cells sized less than
100 in the denominator (population size). This strategy is particularly
wise when data releases may be done from a dataset by multiple individuals
over an extended period of time, as it eliminates the tedious task of
comparing each release to every prior release.
Increase numerator size. In preparing a
data table for dissemination, it is recommended that analysts first
examine the counts in each cell of the table. If rates are desired and the
numerator of any cell is less than 20, an effort should be made to
increase the size of the numerator. (Use of 20 events as the threshold for
reliability is consistent with standard CDC practice.) Techniques to
accomplish this include the following:
- Combine multiple years of data,
- Collapse data categories, and/or
- Expand the geographic area under consideration.
Include confidence intervals. The inclusion of confidence intervals for rates
is strongly recommended regardless of the number of health events, but it
is especially important when the count is less than 20 (see forthcoming
Confidence Interval guideline). Generally, rates with fewer than
20 events in the numerator have very wide confidence intervals.
For example, an infant death rate of 10 per 1,000, based on 20 deaths out of a population of 2,000 live
births, has a Poisson-based 95% confidence interval between 6 and 15. Clearly, this is not very precise
information and users of the data need to know this.
In instances where it is not feasible to incorporate confidence
intervals into a data table (which may be the case with many of DOH's
routinely produced, large data tables), it is recommended that analysts:
- Always report the numerator on which the rate is based and
- Include a footnote indicating that rates based on fewer
than 20 events are likely to be unstable and imprecise.
Suppress rates. Suppress rates based on very
small numbers (i.e., fewer than 5 health events), reporting only the count
(numerator).2
When rates are suppressed, tables should be constructed such that an indicator (e.g., asterisk) appears
in the cell and a legend under the table explains the reason for suppression.
Suppress confidence intervals. When
rates based on very small numbers (i.e., fewer than 5 health events) are suppressed, confidence intervals
should also be suppressed. When confidence intervals are suppressed, tables should be constructed such
that an indicator (e.g., asterisk) appears in the cell and a legend under the table explains the reason
for suppression.
Confidential data/information: Information
that an individual or establishment has provided in a relationship of trust, with the expectation that it
will not be divulged in an identifiable form. The confidentiality of specific data elements or information
in individual databases or record systems is defined by state laws and regulations and/or policies and
procedures developed for those systems.
Confidentiality breach: an
unauthorized release of identifiable or confidential data/information, which may result from a
security failure, intentional inappropriate behavior, human error, or natural disaster. A breach
of confidentiality may or may not result in harm to one or more individuals.
Individually identifiable data/information: Data/information
that identifies, or is reasonably likely to be used to identify, an individual
or an establishment protected under confidentiality laws. Identifiable data/information may include, but
is not limited to, name, address, telephone number, social security number, and medical record number.
Data elements used to identify an individual or protected establishment can vary depending on the geographic
location and other variables (e.g., rarity of person's health condition or patient demographics). For
purposes of this guideline, "identifiable information" includes potentially identifiable information.
Number of events: The number of persons
or events represented in any given cell of tabulated data (e.g., numerator). (see Guidelines for
Using and Developing Rates for Public Health Assessment, available on the Internet at
http://www.doh.wa.gov/Data/guidelines/Rateguide.htm)
Population size: The total number of
persons or events included in the calculation of an event rate (e.g., denominator). (see Guidelines for
Selection of Population Denominators, available on the Internet at
http://www.doh.wa.gov/Data/guidelines/Popguide.htm)
Public use dataset: A "de-identified"
dataset with all individually identifiable data/information removed, and with remaining data fields
modified (through cell suppression, aggregation of data values, or field omission) such that it is
not possible to create any tables which have any cells sized less than 100 in the denominator. In a
PUDS, there will be no unique records, because at least three individuals (or events) are required to
have identical characteristics, owing to the numerator rule for confidentiality.
Rate: A measure of the frequency
of an event per population unit. (see Guidelines for Using and Developing Rates for Public Health
Assessment, available on the Internet at
http://www.doh.wa.gov/Data/guidelines/Rateguide.htm). In these guidelines the terms rate, proportion, percentage, and ratio are
interchangeable.
Sensitive personal information: Whereas
confidential personal information means information collected about a person that is readily identifiable
to that specific individual, sensitive personal information extends beyond that to information which may
be inferred about individuals, where that information is associated with some stigma. Examples are
certain diseases, health conditions, or health practices. The sensitivity of certain personal information
may vary between communities.
Cox LH. Protecting Confidentiality in Small Population Health and Environmental
Statistics. Statistics in Medicine, Vol. 15, 1996: 1895-1905.
Dever GA. Outcome assessment: Small area analysis and quality improvement methods.
In: Improving outcomes in public health practice. Gaithersburg MD: Aspen,
1997: 341-77.
Gold EB. Confidentiality and privacy protection in epidemiologic research. In:
Coughlin SS, Beauchamp TL [eds]. Ethics and Epidemiology. New York: Oxford,
1996: 128-41.
Kleinman JC. Assessing stability of rates and changes in rates. In: Statistical
Notes for Health Planners. National Center for Health Statistics, Number 2,
July, 1976: 9-12.
NCHS Staff Manual on Confidentiality. Dept. of Health and Human
Services, Public Health Service, National Center for Health Statistics. Hyattsville,
MD. September, 1984.
Sweeney L. Weaving technology and policy together to maintain confidentiality.
Journal of Law, Medicine & Ethics 1997;25:98-110.
Relevant Washington state law and regulation:
Public Disclosure law (Chapter 42.17 RCW)
Executive Order on Public
Records Privacy Protections (EO 00-03).
Vital records rules: Requesting a listing or file of vital records with personal
identifiers (WAC 246-490-030),
Requesting vital records information without personal identifiers
(WAC 246-490-020).
The following examples, provided by the data custodians at DOH, include the
major datasets used for assessment in the state.
Birth records
RCW 70.58.055
and
WAC 246-491-039.
Death records
RCW 9.02.100
and
WAC 246-490-110
(for deaths related to abortion),
WAC 246-491-039
(for fetal death records), and
RCW 70.24.105
(for deaths related to HIV-AIDS).
HIV/AIDS and other communicable disease data
RCW 70.24.105,
and the new WAC 246-101 [adopted 7/12/2000 by the state Board of Health, effective
11/30/2000].
Hospital discharge data
RCW 43.70.052
and
WAC 246-455-080.
Cancer registry data
RCW 70.54.250
and
WAC
246-102-070.
Relevant DOH Policies:
Public Disclosure (17.003 dated June 5, 1996).
Employee Responsibilities with Confidential Information (17.005 dated June 1,
1999).
Release of confidential data/information (Policy 17.006 dated August 10, 2000).
See also Attachment A "Decision tree", Attachment B "Data sharing
agreement template", and Attachment C "Instructions for completing
the DOH data sharing agreement".
Issuance of Confidential Death Records/Information (Policy CHS-D2, dated
November 1, 1994)
Website Confidentiality and Public Disclosure.
Note: DOH staff can obtain copies of these policies on the DOH intranet.
Others can obtain copies from the DOH Confidentiality/Privacy Coordinator.
Relevant Guidelines:
Data Analysis
Guidelines.
Guidelines for
Selection of Population Denominators.
Guidelines for
Using and Developing Rates for Public Health Assessment.
Confidence
Intervals Guidelines.
Endnotes:
1 except for: 1) death records for
children under the age of 12 who die as the result of AIDS transmitted at birth,
2) fetal death records (where, like birth certificates, portions are confidential),
and (3) death records for infants who are born alive following an induced
termination of pregnancy (in which case mother's identity is protected).
2 Geographic modeling, including the use of
Bayesian "smoothing," has been used as an alternative to suppression of
rates. A discussion of this method is beyond the scope of these guidelines.
|