Data Cleansing Techniques for Redundancies
Dealing with duplicate data requires a strategy to deal with inconsistent data. The first step would be to standardize addresses with data matching software. Secondly, ensure that you use data entry programs that validate field formats, preventing errors, such as names being entered into a phone number field. Finding all records that contain exactly or approximately the same data in one or more fields is critical. Review the sample below of five records containing six fields in each record:
Name Address City St ZIP® Phone
—— —————– ———— — ———- ————–
1 DAVIS 115 E 1ST ST CLEBURNE TX 76031-2407 (817) 458 9992
2 DAVIS 1 115 ST EAST CLEBURNE TX 76031
3 DAVIS 1 EAST 15TH CLEBURNE DR TX 817-458-9992
4 DAVIS 1 E FIFTEENTH ST CLEBURNE TX 76031 458-9992
5 DAVIS ONE EAST 15TH ST CLEBURNE TX 76031 817-458-9991
You will see that all five of the above records refer to the same person at the same address; no two records are exactly alike. Then consider the possible attempts to locate duplicates in the file:
BROWSE 1: Select records with the same address field. Finds none of the above records.
BROWSE 2: Select records with the same name and same five-digit ZIP. Misses records 1, 3, and 5.
BROWSE 3: Select records with the name “DAVIS”. Misses records 2 and 3 (while probably matching lots of other DAVIS’ at other addresses).
After completing an address correction and field validation, the above listed samples become:
Name Address City St ZIP Phone
—– ———– ——- – ———- ————
1 DAVIS 115 E 1ST ST CLEBURNE TX 76031-2407 817-458-9992
2 DAVIS 115 E 1ST ST CLEBURNE TX 76031-2407
3 DAVIS 115 E 1ST ST CLEBURNE TX 76031-2407 817-458-9992
4 DAVIS 115 E 1ST ST CLEBURNE TX 76031-2407 XXX-458-9992
5 DAVIS 115 E 1ST ST CLEBURNE TX 76031-2407 817-458-9992
Once you have completed standardizing, attempts at duplicate detection will be greatly improved and will have a better chance of finding the correct group of duplicates. By selecting “records with the same address, ZIP, and “soundex name” is an attempt that works perfectly on the above example.
As you start your journey to address redundancies and duplications, Data Ladder is your partner and analytical expert. We can bring simplicity and clarity to an otherwise muddled and complicated project. Have confidence that Data Ladder will help you resolve your data quality issues and measurably improve quality and financial performance. Contact us for more information and to get your free trial.