Sunday, February 2, 2014

Dvořák or Dvorak? Does it really matter?

One of the more significant issues I've run into in bringing some order to my classical music collection is the lack of consistency in metadata entry. Whether it be the composer, artist, or even name of the piece, standards for how metadata is entered seem to vary wildly from CD to CD, wreaking a fair amount of havoc on my ability to organize, sort, and search my collection. But this problem with metadata goes beyond just the description of items in music software, rather extending into the description of information items across the internet, and the whole digital environment as well.

To apply the title question as an example, let's look at the Czech composer Antonín Dvořák. You will probably notice the name is one that is not incredibly friendly to our American keyboards, with accents over the "i" in Antonín and "a" in Dvořák as well as the hacek, or hook, over the "r" of the latter. It only makes sense, then, despite altering the integrity of the name, that we will often prefer to enter it here as the anglicized Antonin Dvorak instead. Regardless of which form of the name is more convenient, however, if you're striving for consistency in data entry, what form should we be looking to use in the "Composer" field of a program like iTunes, Media Monkey, or Music Collector? A browse through the information that my collection of CDs actually produced belies an even bigger problem than expected:

Antonín Dvořák
Antonin Dvorak
Dvořák
Dvorak
Dvořák, Antonín 
Dvorak, Antonin
Dvořák, Antonín (1841-1904)

One can imagine how much more difficult such a trend has on my ability to quickly and efficiently find the particular works I'm looking for when browsing by Composer. Now multiply this by the 20-30 classical composers that are included in the collection, and the sometimes-odd use of the Composer field in modern music, and you've got a true mess on your hands, and a tool that is almost unusable. This doesn't even consider the similar problems that one might encounter in the Title or Artist fields. 

The fix for such a problem, in the library world, is known as authority control. Authority control is the reason that you can utilize a library catalog, and efficiently find results on a particular topic through the use of subject and author search terms that have been pre-defined by organizations such as the Library of Congress in their LC Subject Headings and LC Name Authority File. For this particular case, the LC Name Authority File uses the official entry: Dvořák, Antonín, 1841-1904, which ensures that any works created by Dvořák are uniformly described and easily searchable.

Authority control is an easy fix, in theory, but it also requires the establishment of a person or group to create and maintain the authority files, and the willingness of the affected metadata creators to accept the use of authority records. While it is possible that companies of a like field (i.e. music companies) could agree to the common use of authority records, attempting to similarly gain the allegiance of the millions of us common folk that create metadata on a daily basis is a much more severe task. I don't see a simple solution to this problem, especially considering the multi-national character of the internet and its associated metadata. It may just be that the online environment is too far gone to hope for any meaningful return to organization and control, if there ever was any. What do you think? Do you have any authority control horror stories? 

3 comments:

  1. Unicode plays a role here ... it's largely replaced ASCI in the encoding, representation and handling of text. Not sure how matching algorithms work at this level of machine representation, though .. I need a bit more education myself!!

    Dr. MacCall

    ReplyDelete
  2. Authority control for special characters is especially important in languages like German, where the meaning of words can change depending on the umlauts.

    For instance drücken means to press and drucken means to print and schön means pretty, but schon means already. Although most of them are more subtle than the examples I used, it can still have a big impact on the interpretation of a title or description.

    ReplyDelete
    Replies
    1. I think this is a great point, and also raises a question as to the responsibility of metadata creators in choosing between accuracy and simplicity in the description process. Should metadata be tailored to the needs/requirements of a specific audience (i.e. Americans who are likely unfamiliar with Czech diacritical marks), or to the worldwide community (data should probably be presented in its native form)?

      One of the great characteristics of authority control is that alternate forms can be linked together with the official form, and thus enable the collocation of both, but is it important for information nevertheless to be presented accurately even at the risk of people not understanding it?

      Thanks for the comments.

      Delete