Guidelines for Address Matching and Geocoding
Revision Date: December
19, 2008
(on Data Guidelines main page,
http://www.doh.wa.gov/Data/Guidelines/guidelines.htm,
"Address Matching and Geocoding Data" linked to
http://ww4.doh.wa.gov/gis/geocoding_guideline.htm;
see Juliet's email dated 1/7/2009, "FW: Geocoding
Guideline")
Background
Address Standardization
Address Matching
Geocoding
The Importance of Geocoding
Geocoding Software
Street Centerline Data
Accuracy
Using Geocoded Data
Current geocoding process at the DOH
Appendix A - Definition of Terms
Appendix B - Address
Standardization
Appendix C - Local Government Data
Used
Appendix D - Address Matching
Accuracy
Appendix E - File Structures
Guidelines
For Address Matching and Geocoding (Word Document)
Background
The intent of this guideline is to be a technical
reference for geocoding street addresses, and to provide
some background on the technologies used by the
Washington State Department of Health (DOH). The DOH
Division of Information Resource Management (DIRM)
currently provides address standardization and geocoding
services. These services are available to all divisions
of DOH as well as other State agencies, Local Health
Jurisdictions and other health related agencies. DIRM
attempts to provide the highest quality and number of
street level address matches. To this end, DOH has
entered into data sharing agreements with many
Washington State counties to share accurate street and
parcel ownership data. Combining these data with
commercially available data allows DIRM to maximize the
number and quality of matches. Appendix A provides
definitions of technical terms used in this guideline.
Address Standardization
Address standardization takes a street address and
ZIP code and attempts to correct misspellings and
changes in ZIP codes. The address is parsed into
standard pieces including the house number, street name,
direction prefix, direction suffix, and the street type.
Once these parts of the address are created the values
are then standardized (e.g. AV becomes AVE, LP becomes
Loop). Currently we use the Centrus software from Group
1 Software, Inc. http://www.centrus.com. The data used
by Centrus is proprietary and comes from both the USPS
and Geographic Data Technologies. The data are updated
quarterly and the standardized addresses are
CASS
certified by the U.S. Postal Service (USPS) for bulk
mailing rates. The Centrus software compares addresses
to a USPS national database. This step is critical for
increasing geocoding match rates. (See
Appendix B for
examples.)
Address Matching
Address matching is the process of matching the
street address and ZIP code in the original dataset to
another address and ZIP code. Typically, the second
address and ZIP code represent street centerlines or
ownership parcels. The street centerlines can have
address ranges and ZIP codes assigned to each side of
the street. The ownership parcels have a single address
and ZIP code assigned to a point.
Geocoding
There are three main types of geocoding functions.
The first type assigns latitude and longitude to a
street address that has been matched to a street
centerline or ownership parcel. This is the only type of geocoding covered in this guideline. These addresses can
then be displayed as points on a map, or aggregated to
larger areas (e.g. city limits, wellhead protection
areas, school districts). For example, this type of geocoding can be used to show points on a map for all
the addresses in the Washington State Cancer Registry.
CAUTION: In general, the latitude and longitude at
which a health event occurred are confidential
information.
Just as publishing someone» s address is most often a
violation of confidentiality, data users need to be
sensitive to the scale at which they display dots
representing health events on maps. Before disseminating
such maps they need to be sure that this method of
visualizing data does not violate confidentiality.
The second type of geocoding is used for data without
a street address. If the data in the original dataset
has a geographic reference (e.g. ZIP code, county, U.S.
Census tract) it can be geocoded to those geographic
features. The data can be displayed as counts in
graduated colors on a map. For example, survey results
that contain only ZIP codes can be shown on a map, by
the number of results in each ZIP code.
The third type of geocoding is used for data without
a street address or a specific geographic reference.
This requires a common link between the data in a given
data set and an existing geographic feature. For
example, a data set that contains a hospital name and
bed capacity can be shown at the hospital locations on a
map. This is accomplished by linking the hospital name
with previously geocoded hospitals that also contain the
name.
The Importance of Geocoding
The majority of data that DOH uses has an address or
other geographic reference. It has also been estimated
that over 90% of corporate America» s data has some kind
of geographic reference. Geocoding allows DOH to use
these data to display health-related information on maps
and to conduct geospatial analyses to determine whether
there are geographic patterns in rates of health-related
events. For example
- determining rates of health-related events by
county may require geocoding;
- investigating disease outbreaks and potential
clusters requires accurately geocoded locations; and
- geocoding to larger areas, like ZIP codes, allows
sensitive data to be displayed while maintaining
confidentiality.
A partial listing of geocoded data used at the DOH
includes vital records, cancer registry data, daycare
facilities, cases of sexually transmitted disease,
tobacco retailers, schools, hospitals, pharmacies,
hazardous waste sites, and drug labs.
Geocoding Software
DIRM staff evaluated five software vendors for
accuracy and overall match rates, ArcView, Centrus,
MapMarker, GeoVista and Maptitude. After detailed
benchmarking in 2000, it was still unclear which
software performed the best. While some software, such
as Centrus, provided address standardization that
improved the match rates, the quality of the underlying
data seemed to make the most difference. DIRM decided to
use the combination of Centrus and ArcView GIS.
Street Centerline Data
DIRM staff evaluated a variety of street centerline
data, U.S. Census TIGER 2000-1992, ESRI Streetmap, GDT
Dynamap 2000 and Navigation Technologies. These data
sets were provided by the U.S. Census or purchased from
commercial vendors. While no data set was complete,
Navigation Technologies was the most accurate and
complete for the entire state. Local level street data
were also evaluated. The overall accuracy of street data
obtained from counties and cities is higher than the
other statewide data sets. Appendix C shows the counties
in Washington for which we have digital data available
for streets or parcel centroids. Since no data set was
complete, DIRM recommends using more than one source.
Accuracy
The process of address matching and geocoding
involves many variables that affect the accuracy of the
results. Below is a partial list of some of the
potential inaccuracies.
- The input address or ZIP code is incorrect.
- The address standardization software incorrectly
parses the address or ZIP code.
- The street centerline attribute data may be
incorrect for the address range, street name or ZIP
code.
- A street may be » flipped» so the address is placed
on the wrong side or at the opposite end of the
street. This can place a geocode in the adjacent U.S.
Census tract or even county.
- The various street and parcel data files do not
exactly overlay with U.S. Census tracts. The
boundaries of the tracts are based on TIGER streets.
Latitude and longitude may be more positionally
accurate than the TIGER data resulting in tract
assignments that are incorrect.
Each successful geocode generates a match score
(called » Av_score» ) that reflects the accuracy of the
match. Match scores range from 70 to 100. A score of 100
indicates that after the geocoding software parsed the
address, a street or parcel was found where everything
matched. A score of 0 indicates a centroid match or an
unmatched address. Appendix D contains some examples of
address matches and the assigned scores.
CAUTION: Rates that are based on geocoded data can
change significantly over time. For example, rates based
on data geocoded in 2000 could differ from rates if the
same data were geocoded in 2003. As data and technology
improve, both the number and accuracy of matched records
is expected to increase, and this might affect rate
calculations. Thus, it is important to assess the
proportion of geocoded records and the accuracy of the
matches when interpreting rates or other statistics
based on geocoded data. (See Using Geocoded Data.)
Using Geocoded Data
In order to use the geocoded data, especially at
relatively small geographies such as the sub-county,
there must be a way to evaluate the accuracy of each
geocode. At a minimum the » Av_score» field in the output
file should always accompany the output data. For
example, when there are no street centerline or parcel
matches for the street addresses in the Washington State
Cancer Registry, DIRM uses the 5-digit ZIP code or city
name to assign addresses to the centroid of a ZIP code,
city, or populated place. This process maximizes the
number of records that can be assigned to a county and
is useful for county level rates and reports. The user
can use the » Av_score» to know which records were
geocoded using centroids and which were matched at the
street level. These centroid geocodes may not be
appropriate for small area analysis like cluster
investigations or census tract level analysis. (See
Processing Unmatched Addresses.)
Current geocoding process at the DOH
DIRM uses the Centrus software to perform address
standardization and ArcVIEW software to perform the
geocoding and the assignment of spatial attributes. This
process is automated using the Avenue scripting language
inside ArcView. This allows the use of multiple street
and parcel datasets. The accuracy and source of the
geocodes are also tracked. See Figure 1 for an overview
of this process.
Address Standardization
- Address data are provided to DIRM in a digital
format (i.e. Access, ASCII, dBase).
- The addresses are standardized using the Centrus
software to fix misspellings, and ZIP code errors.
(See Appendix B.) Centrus also attempts to geocode the
addresses, these are used as approximate matches (step
8) below.
Address Matching
- Inside ArcView, the tolerances are set to accept
only close matches.
- The original addresses are matched to street
centerlines using the following data sets. Once a
match is made the address is not used for the next
data set.
- Local Government streets or parcel databases. See
Appendix C.
- NAVTEQ GPS Streets, Navigation Technologies
- Streetmap 1000, Environmental Systems Research
Institute (ESRI)
- For records that are not matched in Step 4, Step 4
is repeated using the standardized addresses.
Tolerances continue to be set to close. This is done
after Step 4, because we first want to use the
original address exactly as it was entered.
- Inside ArcVIEW, the matching tolerances are set to
accept » approximate» matches only.
- Steps 4 and 5 are repeated for records not matched
in Steps 3 » 5.
- If Centrus geocoded any addresses that ArcVIEW did
not, they are included as approximate matches.
Geocoding
- Inside ArcView, the latitude and longitude are
calculated for each matched address. This estimates
the coordinates by averaging along a street segment
and applying a 30» offset from the centerline or using
the centroid of a parcel.
Assigning Attributes
- Each matched address is assigned U.S. Census
attributes and other geographic values. This is
accomplished by comparing the latitude and longitude
to other GIS spatial layers, using a point-in-polygon
operation.
- Two output files in dBase format are created
containing the matched addresses (with additional
attributes) and the unmatched addresses. See
Appendix
E for the file structures.
Processing Unmatched Addresses (not automated)
Depending on the data type, intended use, and the
number of unmatched records there are other options for
geocoding.
If there are only a few unmatched records,
interactive matching can be completed using GIS software
like ArcView. If the user does not have GIS software,
the following link provides for simple geocoding through
a Web browser interface:
http://ww4.doh.wa.gov/scripts/esrimap.dll?Name=geoview&Cmd=Map.
The user will need to edit the output files by hand to
add the appropriate attributes.
If there are a large number of unmatched records, the
ZIP code or city name can be used instead of the street
address. If a match is found, the center (or centroid)
of a ZIP code or city is used to calculate the
latitude/longitude. Using this approximate location,
U.S. Census and other geographic values are assigned.
These types of matches can be used at the level at which
the match occurs or at larger aggregations, but will not
be accurate for other purposes. Centroid matches are not
included in the DOH» s standard process, but are used
with selected data sets, such as the Washington State
Cancer Registry.
Figure 1 Overview of process

Benefits
- Using this iterative approach on multiple data
sets maximizes the number of matched records.
- This approach provides the ability to customize
the assignment of spatial attributes.
- This approach takes advantage of multiple software
packages, utilizing their strengths.
- The output dBase file includes fields to identify
the accuracy and source of the matches.
- Additional geocoding software can be used in an
attempt to match the unmatched records.
- This approach utilizes existing GIS software
maintained and supported by DOH.
- The ESRI shapefile of points representing the
matched addresses can be viewed with many GIS software
packages.
- The ArcVIEW portion of this process is automated
using the Avenue scripting language.
- This approach also provides the ability to add
additional street data sets as they become available.
Appendix A » Definition of
terms
Approximate match is meant to represent acceptable
address matches. This level of matching allows for
slightly misspelled street names or missing street types
or directional information. Normally these matches are
considered tolerable because of the nature of data input
techniques. These matches are widely used for all types
of geocoding projects. Tolerances lower than this will
make matches when no address range is present on the
street or when the street is named 118th and a match is
made with 8th. These matches may be sufficient for
countywide analysis but should not be used for most
types of projects, and are therefore not included in
this standard process.
Assign spatial attributes involves first
geocoding an address then comparing its location to
another GIS spatial layer. These layers most often
contain polygon or area features (e.g. census block
groups, city limits).
Attribute: Information related to a map
feature (e.g. census demographics pertaining to census
tract).
Centroids are points inside a polygon area,
usually the center.
Close match is intended to represent addresses
that match a given street segment, using the street
name, house number, and ZIP code information. The
geocoding process automatically parses the input address
and attempts some limited standardization before the
matching is attempted. These matches are the most
accurate possible.
Street segment is a portion of a street
centerline in a linear GIS spatial layer. Streets are
often divided up into these segments to incorporate
changing address ranges, ZIP codes or other attribute
changes.
CASS (Coding Accuracy Support
System) is a system the U.S. Postal Service uses to
evaluate the accuracy of address-matching software
programs. By being CASS certified, bulk mailing rates
may be applied to the standardized addresses.
TIGER (Topologically
Integrated Geographic Encoding and Referencing system)
refers to the system and data format the U.S. Census
Bureau uses to display geography.
Appendix B » Address Standardization
These are some examples of the address standardizing
Centrus provides.
131 Elm, 98501
131 Elm ST E, 98501 |
Adds street type and direction |
200 Conger Ave, 98502
200 Conger ST NW 98502 |
Changes street type and adds a direction |
1437 MLK WY, 98265
1437 Martin Luther King Jr. Way, 98265 |
Replaces abbreviations |
601 Ryan RD, 98502
601 Ryan RD, 98512 |
Updates the ZIP code if necessary |
800 Lakeridge Dr 27, 98503
800 Lakeridge Dr TRLR 27, 98503 |
Updates the type of unit number |
400 Renton Ave NE, Renton, WA 98356
400 Renton Ave NE, New Castle, WA 98356 |
Updates the city name |
333 Hanovor Lane, 98437
333 E Hanover LN, 98437 |
Corrects the street spelling |
Appendix C - Local Government Data Used
This map highlights the County Governments that DOH
has contacted regarding the use of GIS addressing data.
This data is in the form of street centerline files with
address ranges, or parcel ownership points that contain
the site address.

Appendix D » Address
Matching Accuracy
Input Address Street Segment Attributes in ArcVIEW Av_Score
Typical Close Matches
1490 LK DR. 1466-1574 Lake Dr 100
3706 Shoshone Dr 3700-3798 Shoshone Dr 100
1301 N Highlands Pkwy 1301-1399 N Highlands Pky 100
1301 Highlands Pkway 1301-1399 N Highlands Pky 100
3017 Lombard Ave Apt 809 3001-3099 Lombard Ave 100
Typical Approximate Matches
3110 Camp Road 2 3108-3112 Camp 2 Rd 95
3640 Old Hwy 99 N 3620-3680 Old 99 N 95
281 Dungeness Meadows 200-300 Dungeness Meadows 92
1338 Bellefied Pk Ln 1100-1373 Bellefield Park Ln 91
9531 Forest Del Dr 9400-9600 Forest Dell Dr 90
1130 Fairmount Ave 1100-1198 Fairmont Ave 89
1690 80th Street KP 1660-1700 80th K P St S 88
1258 Weilan St 1200-1298 Weiland St 87
9218 Spearl Pl S 9200-9248 Spear Pl S 86
10326 18th Ave SW 10300-10398 185th Ave SW 85
3412 Undie Rd 3386-3502 Undi Rd 85
1919 Layfayette Rd 1939-1951 Lafayette Rd 84
821 Port Susan Terrace Rd 801-849 Port Susan Ter Rd 83
2800 Erlands Pt Rd NW #44 2700-2898 NW Erlands Point Rd 83
12329 55th Pl W 12101-12399 5th Pl W 82
4226 Wescott Dr 4100-4399 Westcott Dr 81
1521 Hwy 101 W Sp#29 1507-1531 USHY 101 80
5720 Blvd Ext Rd Se 5312-5898 Boulevard Rd Se 79
4450 Abelin Ct S #81 4400-4448 Abelia Ct S 78
219 N Broadway St 201-299 S Broadway St 77
800 17th St Pl Nw 700-802 17th STPL 75
Appendix E - File Structures
Field Name Type Width Decimals Example Description
Input
Street Char 40 1060 S MAIN #47 Input street or mailing address
City Char 20 COLVILLE Input city name (not required)
ZIP Char 10 99114 Input 5 digit ZIP code
Output (Centrus standardization for all records)
N_address Char 40 1060 S MAIN ST TRLR 47 Standardized address
N_city Char 30 COLVILLE Standardized city name
N_ZIP Char 10 99114 Standardized ZIP code (may not match input ZIP)
N_housenum Char 6 1060 House number
N_street Char 30 MAIN Standardized street name
N_strsuf Char 6 ST Standardized street name suffix
N_predir Char 6 S Standardized street name prefix direction
N_postdir Char 6 E Standardized street name suffix direction
N_unit Char 6 47 Standardized unit number
N_unitdes Char 6 TRLR Standardized unit designation
Output (Matched records only)
Accuracy Char 20 Close Type of match, » Close» or » Approximate»
Av_score Num 3 0 100 Match score (100=» Close» , 70-99=» Approx.» )
Source Char 20 TIGER 2000 Name of the street data set used to geocode
Av_city Char 20 Collville City name if inside city limits.
Tract90 Char 6 950500 Census 1990 tract number
Tract90d Char 8 9505.003 Census 1990 Decimal format tract/block group
Bg90 Num 1 0 3 Census 1990 block group number
Block90 Char 4 13B Census 1990 block number
Av_co Char 3 065 Census 1990 County FIPS Code (001-077)
Tract00 Char 6 950500 Census 2000 tract number
Tract00d Char 8 9505.003 Census 2000 Decimal format tract/block group
Bg00 Num 1 0 3 Census 2000 block group number
Block00 Char 4 3745 Census 2000 block number
Zcta Char 5 98502 Census 2000 ZIP Code Tabulation Area
Av_ZIP Char 5 99114 Geocoded ZIP code (may not match input ZIP)
Av_alpha Char 2 33 Alphabetical County ID (01-39)
Av_date Char 30 Tues Jan21 15:44:00 2003 Date geocoded
X_coord Num 15 5 -117.9051 Longitude of address geocode
Y_coord Num 15 5 48.53513 Latitude of address geocode
|