DOH Logo, Link to Department of Health Home Page  

Health Data Guidelines

You are here: DOH Home » Health Data » Data Guidelines » Geocoding Employees | Search 
Site Directory
Data Guidelines

Access Washington logo, State of Washington Home Page
Data Guidelines

Guidelines for Address Matching and Geocoding

Revision Date: December 19, 2008
(on Data Guidelines main page, http://www.doh.wa.gov/Data/Guidelines/guidelines.htm,
"Address Matching and Geocoding Data" linked to http://ww4.doh.wa.gov/gis/geocoding_guideline.htm;
see Juliet's email dated 1/7/2009, "FW: Geocoding Guideline")

Background
Address Standardization
Address Matching
Geocoding
The Importance of Geocoding
Geocoding Software
Street Centerline Data
Accuracy
Using Geocoded Data
Current geocoding process at the DOH
Appendix A - Definition of Terms
Appendix B - Address Standardization
Appendix C - Local Government Data Used
Appendix D - Address Matching Accuracy
Appendix E - File Structures

Guidelines For Address Matching and Geocoding (Word Document)

Background

The intent of this guideline is to be a technical reference for geocoding street addresses, and to provide some background on the technologies used by the Washington State Department of Health (DOH). The DOH Division of Information Resource Management (DIRM) currently provides address standardization and geocoding services. These services are available to all divisions of DOH as well as other State agencies, Local Health Jurisdictions and other health related agencies. DIRM attempts to provide the highest quality and number of street level address matches. To this end, DOH has entered into data sharing agreements with many Washington State counties to share accurate street and parcel ownership data. Combining these data with commercially available data allows DIRM to maximize the number and quality of matches. Appendix A provides definitions of technical terms used in this guideline.

Address Standardization

Address standardization takes a street address and ZIP code and attempts to correct misspellings and changes in ZIP codes. The address is parsed into standard pieces including the house number, street name, direction prefix, direction suffix, and the street type. Once these parts of the address are created the values are then standardized (e.g. AV becomes AVE, LP becomes Loop). Currently we use the Centrus software from Group 1 Software, Inc. http://www.centrus.com. The data used by Centrus is proprietary and comes from both the USPS and Geographic Data Technologies. The data are updated quarterly and the standardized addresses are CASS certified by the U.S. Postal Service (USPS) for bulk mailing rates. The Centrus software compares addresses to a USPS national database. This step is critical for increasing geocoding match rates. (See Appendix B for examples.)

Address Matching

Address matching is the process of matching the street address and ZIP code in the original dataset to another address and ZIP code. Typically, the second address and ZIP code represent street centerlines or ownership parcels. The street centerlines can have address ranges and ZIP codes assigned to each side of the street. The ownership parcels have a single address and ZIP code assigned to a point.

Geocoding

There are three main types of geocoding functions. The first type assigns latitude and longitude to a street address that has been matched to a street centerline or ownership parcel. This is the only type of geocoding covered in this guideline. These addresses can then be displayed as points on a map, or aggregated to larger areas (e.g. city limits, wellhead protection areas, school districts). For example, this type of geocoding can be used to show points on a map for all the addresses in the Washington State Cancer Registry. CAUTION: In general, the latitude and longitude at which a health event occurred are confidential information. Just as publishing someone» s address is most often a violation of confidentiality, data users need to be sensitive to the scale at which they display dots representing health events on maps. Before disseminating such maps they need to be sure that this method of visualizing data does not violate confidentiality.

The second type of geocoding is used for data without a street address. If the data in the original dataset has a geographic reference (e.g. ZIP code, county, U.S. Census tract) it can be geocoded to those geographic features. The data can be displayed as counts in graduated colors on a map. For example, survey results that contain only ZIP codes can be shown on a map, by the number of results in each ZIP code.

The third type of geocoding is used for data without a street address or a specific geographic reference. This requires a common link between the data in a given data set and an existing geographic feature. For example, a data set that contains a hospital name and bed capacity can be shown at the hospital locations on a map. This is accomplished by linking the hospital name with previously geocoded hospitals that also contain the name.

The Importance of Geocoding

The majority of data that DOH uses has an address or other geographic reference. It has also been estimated that over 90% of corporate America» s data has some kind of geographic reference. Geocoding allows DOH to use these data to display health-related information on maps and to conduct geospatial analyses to determine whether there are geographic patterns in rates of health-related events. For example

  • determining rates of health-related events by county may require geocoding;
  • investigating disease outbreaks and potential clusters requires accurately geocoded locations; and
  • geocoding to larger areas, like ZIP codes, allows sensitive data to be displayed while maintaining confidentiality.

A partial listing of geocoded data used at the DOH includes vital records, cancer registry data, daycare facilities, cases of sexually transmitted disease, tobacco retailers, schools, hospitals, pharmacies, hazardous waste sites, and drug labs.

Geocoding Software

DIRM staff evaluated five software vendors for accuracy and overall match rates, ArcView, Centrus, MapMarker, GeoVista and Maptitude. After detailed benchmarking in 2000, it was still unclear which software performed the best. While some software, such as Centrus, provided address standardization that improved the match rates, the quality of the underlying data seemed to make the most difference. DIRM decided to use the combination of Centrus and ArcView GIS.

Street Centerline Data

DIRM staff evaluated a variety of street centerline data, U.S. Census TIGER 2000-1992, ESRI Streetmap, GDT Dynamap 2000 and Navigation Technologies. These data sets were provided by the U.S. Census or purchased from commercial vendors. While no data set was complete, Navigation Technologies was the most accurate and complete for the entire state. Local level street data were also evaluated. The overall accuracy of street data obtained from counties and cities is higher than the other statewide data sets. Appendix C shows the counties in Washington for which we have digital data available for streets or parcel centroids. Since no data set was complete, DIRM recommends using more than one source.

Accuracy

The process of address matching and geocoding involves many variables that affect the accuracy of the results. Below is a partial list of some of the potential inaccuracies.

  • The input address or ZIP code is incorrect.
  • The address standardization software incorrectly parses the address or ZIP code.
  • The street centerline attribute data may be incorrect for the address range, street name or ZIP code.
  • A street may be » flipped» so the address is placed on the wrong side or at the opposite end of the street. This can place a geocode in the adjacent U.S. Census tract or even county.
  • The various street and parcel data files do not exactly overlay with U.S. Census tracts. The boundaries of the tracts are based on TIGER streets. Latitude and longitude may be more positionally accurate than the TIGER data resulting in tract assignments that are incorrect.

Each successful geocode generates a match score (called » Av_score» ) that reflects the accuracy of the match. Match scores range from 70 to 100. A score of 100 indicates that after the geocoding software parsed the address, a street or parcel was found where everything matched. A score of 0 indicates a centroid match or an unmatched address. Appendix D contains some examples of address matches and the assigned scores.

CAUTION: Rates that are based on geocoded data can change significantly over time. For example, rates based on data geocoded in 2000 could differ from rates if the same data were geocoded in 2003. As data and technology improve, both the number and accuracy of matched records is expected to increase, and this might affect rate calculations. Thus, it is important to assess the proportion of geocoded records and the accuracy of the matches when interpreting rates or other statistics based on geocoded data. (See Using Geocoded Data.)

Using Geocoded Data

In order to use the geocoded data, especially at relatively small geographies such as the sub-county, there must be a way to evaluate the accuracy of each geocode. At a minimum the » Av_score» field in the output file should always accompany the output data. For example, when there are no street centerline or parcel matches for the street addresses in the Washington State Cancer Registry, DIRM uses the 5-digit ZIP code or city name to assign addresses to the centroid of a ZIP code, city, or populated place. This process maximizes the number of records that can be assigned to a county and is useful for county level rates and reports. The user can use the » Av_score» to know which records were geocoded using centroids and which were matched at the street level. These centroid geocodes may not be appropriate for small area analysis like cluster investigations or census tract level analysis. (See Processing Unmatched Addresses.)

Current geocoding process at the DOH

DIRM uses the Centrus software to perform address standardization and ArcVIEW software to perform the geocoding and the assignment of spatial attributes. This process is automated using the Avenue scripting language inside ArcView. This allows the use of multiple street and parcel datasets. The accuracy and source of the geocodes are also tracked. See Figure 1 for an overview of this process.

Address Standardization

  1. Address data are provided to DIRM in a digital format (i.e. Access, ASCII, dBase).
  2. The addresses are standardized using the Centrus software to fix misspellings, and ZIP code errors. (See Appendix B.) Centrus also attempts to geocode the addresses, these are used as approximate matches (step 8) below.

Address Matching

  1. Inside ArcView, the tolerances are set to accept only close matches.
  2. The original addresses are matched to street centerlines using the following data sets. Once a match is made the address is not used for the next data set.
  3. Local Government streets or parcel databases. See Appendix C.
  4. NAVTEQ GPS Streets, Navigation Technologies
  5. Streetmap 1000, Environmental Systems Research Institute (ESRI)
  • TIGER 2000, U.S. Census Bureau
  • TIGER 1998, U.S. Census Bureau
  • TIGER 1998, U.S. Census Bureau (up to 10 additional address ranges per street segment)
  • TIGER 1995, U.S. Census Bureau
  • TIGER 1992, U.S. Census Bureau

  1. For records that are not matched in Step 4, Step 4 is repeated using the standardized addresses. Tolerances continue to be set to close. This is done after Step 4, because we first want to use the original address exactly as it was entered.
  2. Inside ArcVIEW, the matching tolerances are set to accept » approximate» matches only.
  3. Steps 4 and 5 are repeated for records not matched in Steps 3 » 5.
  4. If Centrus geocoded any addresses that ArcVIEW did not, they are included as approximate matches.

Geocoding

  1. Inside ArcView, the latitude and longitude are calculated for each matched address. This estimates the coordinates by averaging along a street segment and applying a 30» offset from the centerline or using the centroid of a parcel.

Assigning Attributes

  1. Each matched address is assigned U.S. Census attributes and other geographic values. This is accomplished by comparing the latitude and longitude to other GIS spatial layers, using a point-in-polygon operation.
  2. Two output files in dBase format are created containing the matched addresses (with additional attributes) and the unmatched addresses. See Appendix E for the file structures.

Processing Unmatched Addresses (not automated)

Depending on the data type, intended use, and the number of unmatched records there are other options for geocoding.

If there are only a few unmatched records, interactive matching can be completed using GIS software like ArcView. If the user does not have GIS software, the following link provides for simple geocoding through a Web browser interface: http://ww4.doh.wa.gov/scripts/esrimap.dll?Name=geoview&Cmd=Map. The user will need to edit the output files by hand to add the appropriate attributes.

If there are a large number of unmatched records, the ZIP code or city name can be used instead of the street address. If a match is found, the center (or centroid) of a ZIP code or city is used to calculate the latitude/longitude. Using this approximate location, U.S. Census and other geographic values are assigned. These types of matches can be used at the level at which the match occurs or at larger aggregations, but will not be accurate for other purposes. Centroid matches are not included in the DOH» s standard process, but are used with selected data sets, such as the Washington State Cancer Registry.

Figure 1 Overview of process

Benefits

  • Using this iterative approach on multiple data sets maximizes the number of matched records.
  • This approach provides the ability to customize the assignment of spatial attributes.
  • This approach takes advantage of multiple software packages, utilizing their strengths.
  • The output dBase file includes fields to identify the accuracy and source of the matches.
  • Additional geocoding software can be used in an attempt to match the unmatched records.
  • This approach utilizes existing GIS software maintained and supported by DOH.
  • The ESRI shapefile of points representing the matched addresses can be viewed with many GIS software packages.
  • The ArcVIEW portion of this process is automated using the Avenue scripting language.
  • This approach also provides the ability to add additional street data sets as they become available.

Appendix A » Definition of terms

Approximate match is meant to represent acceptable address matches. This level of matching allows for slightly misspelled street names or missing street types or directional information. Normally these matches are considered tolerable because of the nature of data input techniques. These matches are widely used for all types of geocoding projects. Tolerances lower than this will make matches when no address range is present on the street or when the street is named 118th and a match is made with 8th. These matches may be sufficient for countywide analysis but should not be used for most types of projects, and are therefore not included in this standard process.

Assign spatial attributes involves first geocoding an address then comparing its location to another GIS spatial layer. These layers most often contain polygon or area features (e.g. census block groups, city limits).

Attribute: Information related to a map feature (e.g. census demographics pertaining to census tract).

Centroids are points inside a polygon area, usually the center.

Close match is intended to represent addresses that match a given street segment, using the street name, house number, and ZIP code information. The geocoding process automatically parses the input address and attempts some limited standardization before the matching is attempted. These matches are the most accurate possible.

Street segment is a portion of a street centerline in a linear GIS spatial layer. Streets are often divided up into these segments to incorporate changing address ranges, ZIP codes or other attribute changes.

CASS (Coding Accuracy Support System) is a system the U.S. Postal Service uses to evaluate the accuracy of address-matching software programs. By being CASS certified, bulk mailing rates may be applied to the standardized addresses.

TIGER (Topologically Integrated Geographic Encoding and Referencing system) refers to the system and data format the U.S. Census Bureau uses to display geography.

Appendix B » Address Standardization

These are some examples of the address standardizing Centrus provides.

131 Elm, 98501
131 Elm ST E, 98501
Adds street type and direction
200 Conger Ave, 98502
200 Conger ST NW 98502
Changes street type and adds a direction
1437 MLK WY, 98265
1437 Martin Luther King Jr. Way, 98265
Replaces abbreviations
601 Ryan RD, 98502
601 Ryan RD, 98512
Updates the ZIP code if necessary
800 Lakeridge Dr 27, 98503
800 Lakeridge Dr TRLR 27, 98503
Updates the type of unit number
400 Renton Ave NE, Renton, WA 98356
400 Renton Ave NE, New Castle, WA 98356
Updates the city name
333 Hanovor Lane, 98437
333 E Hanover LN, 98437
Corrects the street spelling

Appendix C - Local Government Data Used

This map highlights the County Governments that DOH has contacted regarding the use of GIS addressing data. This data is in the form of street centerline files with address ranges, or parcel ownership points that contain the site address.

Appendix D » Address Matching Accuracy

Input Address			Street Segment Attributes in ArcVIEW	Av_Score
Typical Close Matches
1490 LK DR.			1466-1574	Lake Dr			100
3706 Shoshone Dr		3700-3798	Shoshone Dr		100
1301 N Highlands Pkwy 		1301-1399	N Highlands Pky		100
1301 Highlands Pkway		1301-1399	N Highlands Pky		100
3017 Lombard Ave Apt 809	3001-3099	Lombard Ave		100	
Typical Approximate Matches
3110 Camp Road 2		3108-3112	Camp 2 Rd		95
3640 Old Hwy 99 N		3620-3680	Old 99 N		95
281 Dungeness Meadows		200-300		Dungeness Meadows	92
1338 Bellefied Pk Ln		1100-1373	Bellefield Park Ln	91
9531 Forest Del Dr		9400-9600	Forest Dell Dr		90
1130 Fairmount Ave		1100-1198	Fairmont Ave		89
1690 80th Street KP		1660-1700	80th K P St S		88
1258 Weilan St			1200-1298	Weiland St		87
9218 Spearl Pl S		9200-9248	Spear Pl S		86
10326 18th Ave SW		10300-10398	185th Ave SW		85
3412 Undie Rd			3386-3502	Undi Rd			85
1919 Layfayette Rd		1939-1951	Lafayette Rd		84
821 Port Susan Terrace	 Rd	801-849		Port Susan Ter Rd	83
2800 Erlands Pt Rd NW #44	2700-2898	NW Erlands Point Rd	83
12329 55th Pl W			12101-12399	5th Pl W		82
4226 Wescott Dr			4100-4399	Westcott Dr		81
1521 Hwy 101 W Sp#29		1507-1531	USHY 101		80
5720 Blvd Ext Rd Se		5312-5898	Boulevard Rd Se		79
4450 Abelin Ct S #81		4400-4448	Abelia Ct S		78
219 N Broadway St		201-299		S Broadway St		77
800 17th St Pl Nw		700-802		17th STPL		75

Appendix E - File Structures

Field Name	Type	Width	Decimals	Example			Description          

Input
Street		Char	40		1060 S MAIN #47		Input street or mailing address	
City		Char	20		COLVILLE			Input city name (not required)
ZIP		Char	10		99114			Input 5 digit ZIP code

Output (Centrus standardization for all records)	
N_address		Char	40		1060 S MAIN ST TRLR 47	Standardized address
N_city		Char	30		COLVILLE			Standardized city name
N_ZIP		Char	10		99114			Standardized ZIP code (may not match input ZIP)
N_housenum	Char	6		1060			House number
N_street		Char	30		MAIN			Standardized street name
N_strsuf		Char	6		ST			Standardized street name suffix 
N_predir		Char	6		S			Standardized street name prefix direction
N_postdir		Char	6		E			Standardized street name suffix direction
N_unit		Char	6		47			Standardized unit number
N_unitdes		Char	6		TRLR			Standardized unit designation

Output (Matched records only)
Accuracy		Char	20		Close			Type of match, » Close»  or » Approximate» 
Av_score		Num	3	0	100			Match score (100=» Close» , 70-99=» Approx.» )
Source		Char	20		TIGER 2000		Name of the street data set used to geocode
Av_city		Char	20		Collville			City name if inside city limits.
Tract90		Char	6		950500			Census 1990 tract number
Tract90d		Char	8		9505.003			Census 1990 Decimal format tract/block group
Bg90		Num	1	0	3			Census 1990 block group number
Block90		Char	4		13B			Census 1990 block number
Av_co		Char	3		065			Census 1990 County FIPS Code (001-077)
Tract00		Char	6		950500			Census 2000 tract number
Tract00d		Char	8		9505.003			Census 2000 Decimal format tract/block group
Bg00		Num	1	0	3			Census 2000 block group number
Block00		Char	4		3745			Census 2000 block number
Zcta		Char	5		98502			Census 2000 ZIP Code Tabulation Area
Av_ZIP		Char	5		99114			Geocoded ZIP code (may not match input ZIP)
Av_alpha		Char	2		33			Alphabetical County ID (01-39)
Av_date		Char	30		Tues Jan21 15:44:00 2003	Date geocoded 
X_coord		Num	15	5	-117.9051			Longitude of address geocode
Y_coord		Num	15	5	48.53513			Latitude of address geocode


DOH Home |  Access Washington |  Privacy Notice |  Disclaimer/Copyright Information

Washington State Department of Health
101 Israel Rd SE, PO Box 47812
Olympia, WA 98504-7812

Last Update : 10/20/2009 11:39 AM
Send inquires about DOH and its programs to the Health Consumer Assistance Office
Comments or questions regarding this web page? Send email to Ramona Nelson.