Parsing Author Names

Parsing Author Names #


The many vagaries of parsing and dealing with proper names are legend (for the general lay of the problem, see this article from the W3C). Combining this with the many ways in which (i) authors format their names for publication and (ii) bibliographic data providers format this name data before we receive it makes the problem even worse.

Sciveyor is thus forced to make a few assumptions about the data to maximize search results. First, all data in our internal data store is saved in whatever format we receive it from journal authors. Sometimes this will mean it’s been properly parsed into first, last, middle, prefix, and suffix, and sometimes it will just be stored as a plain text name.

When we parse names, then – for instance, when a user tries to search for articles by author – we try some various forms of name parsing that can help us use poor quality incoming data.

  1. Parse the name to determine what the user is searching for. Our name-parsing step tries to invert comma-separated names (“Last, First”), and takes account of suffixes (Jr., Sr., etc.) as well as what BibTeX traditionally calls the “von-part” (von, van, van der, etc.). It also is able to detect when users provide a string of initials instead of a name (e.g., “JHQ Doe”). Finally, it is aware of several varieties of “Last, First” ordering, which will be automatically converted. (For those interested, we’re essentially implementing the name parsing spec from BibTeX.)
  2. Create a Solr query for (1) just the first and last name, and (2, if available) the first, last, and middle names.
  3. For each set of names, check the following:
    1. If the name is a single initial, search it with a wildcard, so it might match a full name.
    2. If a name is not a single initial, query it both as the full name and as an initial without a wildcard.
    3. If multiple initials in a row are present in any search term, create a new search term which has them collapsed together.

This logic is easiest to display in action. Consider a particularly complex case, where the user searches for “Doe, John Jay L.”. The parsing of this name goes like this:

  • Recognize the “Last, First” format.

    first: John Jay L.
    last: Doe
    
  • Construct one query for just “John Doe”.

    • Only one first name, and it’s not a single initial. We thus produce two search queries for this name:

      "John Doe"
      "J Doe"
      
  • Construct one query for “John Jay L. Doe”.

    • This has three first and middle names, and one of them is a single initial. First, we produce every combination of the names that are present, both as full names and as initials (also, note the wildcard, as L was specified by the user as an initial):

      "John Jay L* Doe"
      "John J L* Doe"
      "J Jay L* Doe"
      "J J L* Doe"
      
    • Now, we add to this list the result of combining together multiple runs of initials:

      "John JL* Doe"
      "JJ L* Doe"
      "J JL* Doe"
      "JJL* Doe"
      
  • Finally, combine them all together to get:

    "John Doe"
    "J Doe"
    "John Jay L* Doe"
    "John J L* Doe"
    "John JL* Doe"
    "J Jay L* Doe"
    "J J L* Doe"
    "JJ L* Doe"
    "J JL* Doe"
    "JJL* Doe"
    

Thus we’ve created a sum total of ten search terms from this particular name. They’ll match everything from “JJL Doe” “John Jay Lucas Doe”.

If you can think of some edge cases we’ve missed, we’d love to hear about it! It’s of course known that this will fail miserably when it comes to non-Western names in non-Latin scripts. Unfortunately, there’s very little data available for testing that’s in such formats, so we don’t really have anything to go on, and most data from journal publishers is released in romanized or latinized forms (for better and/or worse).