Compare 2 Lists of Names

Compare 2 lists, with different name formats, different file formats (excel, txt, csv, pdf …) etc

  1. convert all files into csv with strings “” encoded
  2. convert to common encoding (usually utf-8 or  Win – 1252)
  3. parse names into nickname, salutation, first, initials or middle, last, suffix **
  4. match on last, then first  do exception list
  5. do fuzzy match on last, first do exception list

I ask people to give me excel files. I give them a little help on how to do that, but a search on importing and converting data into excel usually makes it pretty easy. This also usually gets the encoding into utf-8, however, that is a separate issue.

I load the data from the output of 3 into an sql table and do a number of matches

`FName` varchar(50) COLLATE latin1_bin NOT NULL DEFAULT ”,
`LName` varchar(50) COLLATE latin1_bin NOT NULL DEFAULT ”,
`H` ENUM(‘Y’, ‘N’) default ‘N’,
`P` ENUM(‘Y’, ‘N’) default ‘N’,
`K` ENUM(‘Y’, ‘N’) default ‘N’,
`R` ENUM(‘Y’, ‘N’) default ‘N’,
PRIMARY KEY (`LName`,`FName`)

The sample code I looked at for the name disambiguation is a classic string compare algorithm.

import Algorithmia
import csv
import sys
from numpy import zeros
from time import time

def apply(input):
venues = Algorithmia.file(“data://Nilojyoti/dblp/dblp_venues.csv”).getString()
wiki_venues = Algorithmia.file(“data://Nilojyoti/dblp/wikipedia_venues.csv”).getString()

vlist = venues.split(‘,\n’)
wlist = wiki_venues.split(‘\n’)
result_list = {}

for venue in vlist:
mindist = sys.maxint
for wikivenue in wlist:
distance = edDistDp(venue, wikivenue.split(‘ – ‘)[0])
if distance < mindist:
mindist = distance
wmatch = wikivenue
return result_list

def edDistDp(x, y):
“”” Calculate edit distance between sequences x and y using
matrix dynamic programming. Return distance. “””
D = zeros((len(x)+1, len(y)+1), dtype=int)
D[0, 1:] = range(1, len(y)+1)
D[1:, 0] = range(1, len(x)+1)
for i in xrange(1, len(x)+1):
for j in xrange(1, len(y)+1):
delt = 1 if x[i-1] != y[j-1] else 0
D[i, j] = min(D[i-1, j-1]+delt, D[i-1, j]+1, D[i, j-1]+1)
return D[len(x), len(y)]

Example name parser

Array (
[nickname] =>
[salutation] => Mr.
[fname] => Anthony
[initials] => R
[lname] => Von Fange
[suffix] => III
**The algorithm:**

We start by splitting the full name into separate words. We then do a dictionary lookup on the first and last words to see if they are a common prefix or suffix. Next, we take the middle portion of the string (everything minus the prefix & suffix) and look at everything except the last word of that string. We then loop through each of those words concatenating them together to make up the first name. While we’re doing that, we watch for any indication of a compound last name. It turns out that almost every compound last name starts with 1 of 16 prefixes (Von, Van, Vere, etc). If we see one of those prefixes, we break out of the first name loop and move on to concatenating the last name. We handle the capitalization issue by checking for camel-case before uppercasing the first letter of each word and lowercasing everything else. I wrote special cases for periods and dashes. We also have a couple other special cases, like ignoring words in parentheses all-together.

Check examples.php for the test suite and examples of how various name formats are parsed.

**Possible improvements**

* Handle the “Lname, Fname” format
* Separate the parsing of the name from the normalization & capitalization & make those optional
* Separate the dictionaries from the code to make it easier to do localization
* Add common name libraries to allow for things like gender detection

**Same logic, different languages**

* [Name Parser in Java](
* [Name Parser in JavaScript](
* [Name Parser in CSharp](

**Credits & license:**

* Read more about the inspiration for this [PHP Name Parser]( library by [Josh Fraser](
* Special thanks to [Josh Jones](, [Timothy Wood](, [Michael Waskosky](, [Eric Celeste](, [Josh Houghtelin]( and [monitaure]( for their contributions. Pull requests are always welcome as long as you don’t break the test suite.
* Released under Apache 2.0 license

** Nickname is important to the convention and association industry ( a big user of this functionality) as nickname is often used on badges identifying attendees instead of first name

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.