Tuesday, February 25, 2014

Dublin Core: Format

Having officially submitted my ranked list of Dublin Core elements to our professor for assignment, I was more than a little surprised about some of the less obvious ones that managed to squirm their way up toward the top of my list. Date, as I discussed last blog, ended up being my first choice, but the equally unexpected Format element did not follow too far behind. Now when you normally hear the word "format" in the Digital Age, the idea of specific types of computer files is probably something that comes to mind - Word documents, Powerpoint presentations, mp3s, mpegs, and jpegs all being notable examples we encounter daily. It's a safe bet that a Format element is likely to include this type of technical information, but as I discovered, it also goes beyond that in its task of helping to describe the nature of an informational resource.

The Dublin Core Format element aids metadata creators in providing a location for the specific physical and digital qualities an information resource to be described. For digital items, this includes details such as: file formats, programming languages, file sizes, digital dimensions (i.e. resolution), and encoding standards. When describing physical materials, Format would similarly describe information such as: physical dimensions, duration, and the specific medium carrying the resource (i.e. audio cassette, 35mm photograph). Like many of the Dublin Core elements, Format also has several refinements, or qualifiers, that can enable a greater specificity in the level of description, including: IMT (Internet Media Type), Extent, and Medium.

While this type of description may seem to be more on the mundane side than something like the Subject tag, it nevertheless still plays a vital role in the information that it conveys to users. Knowledge of specific file formats or encoding standards, for example, can alert users to the need for certain types of software in order to access the desired information resource. Description of the physical extent or medium for an item can similarly help a user decide whether a specific information object adequately fits criteria they are looking for. Format is thus a fairly integral aspect of describing both digital and physical materials, and an intriguing possibility to work with.

Sunday, February 23, 2014

Dublin Core: Date

As I'm going about trying to order the DC elements I would like to work with, it is interesting to see some of the complexity, and problems, with elements that would appear to be fairly straight-forward and simple. Take the Date element for instance...not a real difficult item to wrap your head around, right? Just put the copyright date or something? Well, no, it's not quite THAT simple. Dublin Core actually allows several qualifiers to be used in conjunction with Date, such as: Created, Valid, Available, Issued, and Modified, in order to modify the type of temporal information that can be entered in a metadata record. Thus, you can use DateIssued for something like the publication year for a work, or DateCreated for when an electronic document was first generated if needing more specificity than the general Date element can provide.

Another consideration for the entry of date information includes the exact format or sequence with which it is entered into a record. For example, today's date can be described:
February 23, 2014
February 23, '14
Feb 23, 2014
23 February, 2014
2/23/14
2/23/2014
23/2/14
23/2/2014
2-23-14
2.23.14
And that just scratches the surface of variations possible, not to mention factoring in foreign languages. So which format does a metadata creator use to ensure that the date is adequately understood by potential users? Accounting for regional differences in date presentation, and the fact that computers can't even interpret some of the variations, there is no perfect solution. The W3CDTF encoding scheme is one possible solution, and prescribes the use of the YYYY-MM-DD format for single dates. Whichever method is settled upon by the metadata creator, however, the importance of consistency in date entry can not be overstated. 

Thursday, February 20, 2014

What Dublin Core element would YOU like to work with?

Our digital imaging project has been assigned, and we've got Dublin Core elements spinning around in our head. But what to choose, what to choose?

Like many of my LS 566 classmates, I'm currently trying to figure out how to make some sense out of ranking the 15 Dublin Core elements in order of preference to work with for our big semester project. A whole Dublin Core element all to myself, should be easy right? Pick the easy one you dolt! Well, some of them are theoretically simpler than others, but they all have their wrinkles and quirks, and I don't foresee any one path carrying significantly less work than any other...not that that is a great way of going about choosing to begin with.

So, what do we have to work with here...I know I left a list of Dublin Core elements lying around here somewhere.

Voila!:
-Title
-Creator
-Subject
-Description
-Publisher
-Contributor
-Date
-Type
-Format
-Identifier
-Source
-Language
-Relation
-Coverage
-Rights

If I'm going to be spending a lot of time on this project, and I have no reason to suspect I won't, then I might as well get my money's worth and pick something I'm actually interested in working with. Subject is an obvious choice, right off the bat, as I have enjoyed subject description in the past, and am always fascinated with trying to figure out what an item is about. Perhaps I can bring some elements of controlled vocabulary to it as well?

Title is pretty straightforward, but when dealing with items that don't necessarily have an already-defined title, it posits the chance of complications.

Creator...oooo, there's another one where some possibility of vocabulary control exists.

Lot of tough choices ahead, but I'm sure there will be some extremely valuable experience going along with any element I ultimately get assigned. What would you pick?

Wednesday, February 19, 2014

On the importance of subject description

In my last blog, I touched a little bit on some of the basic differences between keyword and subject searches. For the library field, the notions of controlled vocabulary and subject headings have been under scrutiny and debate for some time, owing to the proliferation of natural language searching and the still-present difficulty in helping users understand how to properly use a catalog subject search. Two articles I recently encountered for my cataloging class break down some of the reasons why subject headings retain an important place in the library catalog, and suggest what part they might still play in the growing digital information world.

The first article, by Arlene Taylor and Tina Gross, entitled, “What Have We Got to Lose? The Effect of Controlled Vocabulary on Keyword Searching Results,” points to the important role that subject headings and subject searches play in facilitating retrieval of relevant resources for catalog users. Subject headings not only provide a controlled method for searching, but also function indirectly through showing up via keyword searches. According to the study done by Taylor and Gross, if the subject entry were to be eliminated from the catalog, or withdrawn from keyword searches, approximately one-third of related resources would fail to provide hits. This is a significant number of items that would elude the user’s search. Additionally, since the terms in question are located in the subject field, they are more likely to provide quality resources as relate to the user’s search of terms found in, say, the summary field, or table of contents (which are sometimes entered into a record). This isn’t to completely discard keyword searching, however. Users appear to have difficulty understanding how to search for controlled-vocabulary subjects, at times, or may not know exactly which terms to utilize for a given search. As keyword searches are easier, in some ways, to understand, they are sometimes preferred by users. So it is not so much a matter of choosing keyword searches or controlled vocabulary searches, rather it is ensuring that we infuse the precision of the latter into the usability of the former.


In “On the Subject of Subjects,” Arlene Taylor further advocates for the importance of utilizing subject headings and controlled vocabularies, as well as extending their use into the electronic domain and the Internet. Taylor points out that despite the pre-eminence of keyword searches for use by Internet users, keywords continue to be a poor option where precision of search results are desired. Not only does controlled vocabulary counteract this through the notion of specific entry, but it also serves to increase the possible search parameters for a user through the relating of broader and narrower terms for a subject, as well as synonyms, near synonyms, and other valid variations. Controlled, subject-based description does is not without its faults, however. Taylor expresses that there is certainly a place for keyword-based techniques, owing to their simplicity, lower cost to create, ease of maintenance (automated), and ability to stay current. What is advocated for the digital environment is a system in which less important, ephemeral resources can be indexed automatically using keyword-technologies, while those intended for long-term use can fall under controlled vocabulary systems. 

It is interesting to note that since the latter article’s publication almost 20 years ago, subject-based vocabulary control does not seem to have taken a strong foothold in the online environment, and likely for very obvious reasons - cost, speed, currency, and the ability to automate. While resources such as WorldCat and the Digital Public Library of America, among others, show the possibilities that online, linked resources can accomplish with controlled vocabularies, the world of digital information in general exhibits a pretty low degree of accurate description, much to the chagrin of searchers and their millions upon millions of hits. Is there a better way? Surely now that Pandora's box has been ripped open, it will be all but impossible to stuff it back in.

-Gross, T, and AG Taylor. n.d. "What have we got to lose? The effect of controlled vocabulary on keyword searching results." College & Research Libraries 66, no. 3: 212-230. Social Sciences Citation Index, EBSCOhost (accessed February 18, 2014).

-Taylor, Arlene G., 1941-. 1995. "On the subject of subjects." Journal Of Academic Librarianship 21, 484-491. Education Full Text (H.W. Wilson), EBSCOhost (accessed February 18, 2014).

Sunday, February 16, 2014

Keyword versus Subject Searching

One of the major differences between a controlled metadata system like a library catalog, and the search engines that we utilize every day on the internet, is the way in which subject metadata is created, indexed, and searched for. A library catalog, for instance, runs on the idea of a controlled vocabulary, whereby items are allocated pre-defined subject headings by human catalogers (e.g. History--Byzantine Empire). This gives the item a fairly detailed and accurate description, and facilitates the reliable search and retrieval of information resources that library users are familiar with. Search engines, on the other hand, largely rely on automated indexing, and utilize a keyword, or natural language, style of search (e.g. "What year did Jesse Owens win his gold medals?"). The difference between the two systems is fairly well reflected in the results one will see for a typical internet search - millions of hits, and the retrieval of an extraordinary amount of web sites that have absolutely no relevance to the original search...oh, and don't forget porn sites. This blog entry isn't to make a claim in favor of one or the other, for it is a gross simplification. Both controlled vocabulary, subject-based searches, as well as keyword natural language types have their role in the metadata world, and their relative strengths and weaknesses.

If the differences between keyword and subject searches are new to you, the George Mason University Libraries provides a fairly helpful guide that notes the differences between the two search methods, and the instances in which each should be used. While the information pertains mostly to the use keyword and subject searches in a library catalog, they can be equally applied to online resources, discovery services, and search engines.  There is also a brief, but helpful, section on how to use Boolean operators (AND, OR, NOT, etc...) to get the most out of complex searches. Searching may seem to be a very basic skill, but there are actually quite a few tricks that are useful for those looking to spend more time reading relevant articles and pages, and less time browsing pages upon pages of irrelevant search results. With more and more information resources moving into the digital environment, the ability to quickly and efficiently search for specific items will only become more valuable.

Friday, February 14, 2014

Digital Public Library

A classmate of mine posted a recent blog discussing the Digital Public Library of America, or DPLA. Though I'd seen the DPLA referenced here and there in articles read for other classes, I never really took the time to check it out until now. Having now rectified that oversight, I can say that the DPLA looks to be a fantastic resource, combining some of the best elements one would expect to see in an online library: picture galleries, archival collections, and, of course, a vast number of digitized books from some of the more noteworthy institutions across the United States. Additionally, the DPLA boasts features that give the collection a distinctly modern and digital flavor: apps, games, virtual bookshelves, discovery services, and yes, even a way to make historical Lolcats.

One area that is of particular interest in scanning the DPLA is viewing the way that item records and metadata are utilized. While books, electronic documents, archival resources, and visual media all are normally described using different standards and protocols, the DPLA brings all its resources under a unified internal metadata schema. Unlike some other schema, however, DPLA's also appears to integrate a number of controlled vocabularies and thesauri, which means that the system retains some of the precision in search and retrieval that one would expect to find in other vocabulary-controlled systems, like a library catalog. Though undoubtedly far from perfect, I think the DPLA represents some interesting possibilities for the potential of library-type collections in an online environment, bringing together some of the structured, organized, and controlled features of library collections, with the vast amount of information and interactivity available in the digital world.

Wednesday, February 12, 2014

The Ins and Outs of File Naming

While metadata can often come in forms that are increasingly technical and complicated, there are also many forms that are simple to use and understand, and encountered by many of us on an almost daily basis. However, even simplicity and accessibility provide no guarantee that metadata will be used properly, as is evidenced by a form of metadata we are almost all familiar with, and commonly misuse...electronic file names. Rather than take full advantage of this gem of an organizational, and retrieval, tool for our computers and devices, we instead clog up document folders with endless batches of "Untitled" documents, and incredibly useful file descriptions such as "receipt", "Agenda," or "list." If you, like me, have failed to show adequate appreciation for this easy-to-use metadata tool, then take heart that the State Library of North Carolina has come to rescue us with a series of four short videos on File Naming Guidelines.

The SLNC covers a variety of basic topics related to file naming, including: why it is important, how to alter existing file names, and some best practices to follow, as well as some to avoid. Though some of the information is fairly intuitive and already familiar to users, the videos do provide suggestions and warnings that otherwise may not have been considered. For example, failing to create unique file names for automatically-named files (i.e. digital photos uploaded from a camera) could lead to important files or documents being overwritten and lost, though most modern operating systems have some measure of protection against such an eventuality. Also of interest are the suggestions for characters to avoid in creating file names, including: most special characters (e.g. !, ?, /, $, and the like due to their use in programming languages), spaces (the underscore special character is an acceptable alternative), and capital letters (software often makes no distinction between case). Among the best practices I found most helpful was the recommendation to include a consistently formatted date in file names, which can provide valuable context when searching through old files.

Though not discussed in the videos, I feel that the guidelines provided would also be particularly useful to organizations concerned with long-term preservation, or storage, of electronic files. Naming practices have a considerable impact on the ability of an archivist or records custodian to make sense of the original use, function, organization, and order of electronic documents. Absent a logical and consistent naming scheme, attempts to organize the documents into a meaningful collection can be severely compromised.



Sunday, February 9, 2014

Metadata: Good, Bad, or all in how you use it?

The subject of metadata has definitely been making its rounds in the news of late, often in a more negative than positive light. Between the collection of personal metadata by governmental agencies, and the theft of metadata that led to the compromising of millions of Target customers' credit card information, we are all more aware of how vital metadata can be to our privacy and security, and the potentially malicious ways in which it can be gathered and utilized. But are these instances enough to cause a shift in the way our metadata is treated, and overshadow the great number of uses for it? Two of my LS 566 classmates have posted recent blogs that have well represented the extremes of the Good/Bad scale that metadata application seems to slide along.

Heather Castle's blog post, Metadata and the Olympics provides links and discussion about metadata collection occurring during the 2014 Sochi Winter Games. Among the less-than-surprising revelations is the disclosure of significant amounts of metadata being collected by the Russian government concerning personal information for users of communication and information networks at the Olympics. Such data includes personally identifiable information, and even payment for products purchased and services rendered. Ostensibly being utilized to ensure security of the Olympic Games, a potentially high-profile target for nefarious or malicious activities, this disclosure still represents the less exciting uses for metadata that we are seeing more and more of on a regular basis.

On the other end of the spectrum, a recent post by Danielle Tinkler discusses the use of metadata in tracking flu activity across the globe. Illustrating information such as the severity (quantity) and dispersion of flu cases, both nationally and internationally, this use of metadata has tremendous potential value in keeping the public aware of the potential for infection, but also in assisting health officials to decide how best to utilize the resources at their disposal.

As with any tool, the threat, or value, of metadata has far more to do with the motivations of the people collecting and presenting it, than it does the data itself. If we seek to curtail the collection of metadata that potentially poses a threat to us, do we run the risk of equally preventing the use of metadata that provides value? Can we have the latter without exposing ourselves to the former?

Friday, February 7, 2014

Metadata Concepts: Dublin Core


One of the important concepts mentioned frequently in our current text reading is Dublin Core. Dublin Core (DC) is a metadata initiative and element set resulting from a 1995 metadata workshop, in Dublin, Ohio, conducted by the OCLC and NCSA. It has been adopted as an ANSI/NISO, ISO, and IETF standard. DC was originally conceived and created as an effort to extend a standard metadata set to the description of electronic documents, as standards of the time skewed heavily toward the description of physical items. Rather than as a replacement to existing standards, Dublin Core was designed more as a complementary element set, capable of “filling in the gaps” for document and object types that other metadata standards failed to adequately cover. Compatible with a variety of coding languages, including HTML and XML, DC produces simple, structured records which can be used independently, or in conjunction with other metadata standards (i.e. MARC). The simplicity of using the element set, coupled with its ability to be modified, extended, and interoperably used with other standards, has led to its widespread acceptance and use in the metadata community.

The base Dublin Core set consists of fifteen elements:
            -Title
                -Creator
                -Subject
                -Description
                -Publisher
                -Contributor
                -Date
                -Type
                -Format
                -Identifier
                -Source
                -Language
                -Relation
                -Coverage
                -Rights

These core elements are all optional to use, and can also be repeated by the metadata creator. The focus of the elements trends toward the areas of object search/retrieval, and resource discovery, but the extendable nature of Dublin Core enables the set to be adapted and modified to the needs of specific domains and communities. Sets of extra elements, known as qualifiers, can be utilized in conjunction with the base elements to extend the scope of the set in a standardized way. Examples include the Canberra Qualifiers: Language, Scheme, and Type, which expand upon the functionality of the base elements, but cannot be used independently.

Dublin Core plays a significant role in the world of metadata standards, especially as applied to electronic documents and media. Its simplicity and adaptability allow it to be tailored to the individual needs of specific information communities, as is illustrated the set's mapping to schema such as MODS and MADS, among others. The set continues to undergo development and revision through the fostering of the Dublin Core Metadata Initiative (DCMI).

Monday, February 3, 2014

Misusing metadata to get the most out of your classical music

Metadata isn't something sacred, although I sometimes can't shake the feeling that using it incorrectly is violating some kind of unwritten cosmic law. It's a tool, and like most tools, you want to use it in the way that it was intended, or your results may skew from the expected. Musical metadata is no exception. Song titles, album titles, artists, composers, and genres are all informational tools that, when used properly, should organize your collection, collocate (or relate together) similar items, and make searching for, and locating, specific items an efficient and painless process.

But sometimes, your music software or device just doesn't work the way you want it to, and the only real solution is changing the metadata from something that's technically correct to something that works a little better in practice. This blog entry will touch on some of the reasons I've had to "fix" my metadata in ways that might not quite be right, but have provided me with the functionality I desired. Be warned though, brave souls, the Naxos Blog discourages you from going to too great a length in disturbing your musical metadata.


1. Problems with device or software functionality
Sometimes, you need to break your metadata because your device or software just won't work when it's entered correctly. Having recently purchased a Sansa Zip Clip mp3 player in my effort to throw off the Apple yoke (ok, I still patronize iTunes), I was unpleasantly surprised to learn that my new gem of a music player doesn't recognize the "Disc #" metadata field for a multi-disc set. By not recognizing this tag, my player could not understand that the six track 1s for the set actually belonged to six different discs, and instead treated them all as tracks from the same disc, completely compromising the player's ability to play tracks in the correct order. While perhaps less of a problem for rock or pop albums, for a set of symphonies, this was devastating.

To fix this, I was left with two alternatives: break the multi-disc set up so that each disc was treated as a separate album (i.e. Beethoven: The Symphonies [Disc 1], Beethoven: The Symphonies [Disc 2], etc...), or alter the track numbering scheme. The former, while perhaps doing less to compromise the integrity of the metadata, would have severe implications on my ability to quickly search or sort through albums (imagine adding an extra 30-40 album titles to your list), so I chose the latter. Employing a numbering scheme by which all tracks in the set were consecutively, and uniquely, numbered (i.e. disc 6 has tracks #35-43 rather than #1-8), I was able to restore the compilation to regain its proper ordering for both my iTunes-based devices, as well as my Zip Clip.

2. Customizing naming conventions
Consistency of metadata is vital to its effective use, both in enabling informational items to be grouped and sorted, but also searched for. When dealing with the description of classical works, naming conventions for an item can often vary wildly. Consider the final movement of Brahms' Third Symphony, titled, as most movements are, for its tempo, "Allegro." The number of classical movements entitled "Allegro" surely counts in the thousands, if not more, making the task of proper identification and retrieval of this particular piece extremely problematic if that is the only information used for the title. But what information should then be included in the title to ease the process? Potential options include:

Symphony No. 3-Allegro
Third Symphony-Allegro
Brahms' Third Symphony-Allegro
Brahms: Third Symphony-Allegro
Brahms: Symphony No. 3 in F major-Allegro
Brahms: Symphony No. 3 in F major, Op.90 - 4. Allegro

All are technically correct, and vary in description from relatively simple to rather complex. For my purposes, I used the last of the naming conventions listed, as I trend toward detailed descriptions. The format does have some drawbacks, however. The use of the composer name at the beginning of the description helps with sorting, but should only technically appear in the "Composer" field. Also of concern is the number of characters used, which could exceed the displayed character limit on certain devices, and prevent the full title from being seen. Additionally, ensuring that all pieces in the collection conform to this format is an extremely time consuming process, and with so much information being entered, attention to detail is vital to ensure each title is entered correctly.

This may be less a misuse of metadata then simply trying to get the most out of it, but custom naming conventions often require a significant amount of metadata to be altered. Whatever you may decide to use, try to pick a format early in the process, and stick to it.

3. Using incorrect fields to ease searching/sorting
One trick I've seen mentioned on many classical music blogs and forums is the deliberate use of metadata in improper fields to ease the searching/sorting process. For much modern music, albums will have a single artist, while possibly many songwriters or composers, making "Artist" a preferred sorting/search field. For classical music, this is reversed, with "Composer" being a relatively singular field for an album, while having many potential artists. This can leave the classical listener with a relatively more difficult search when using software and devices that place greater emphasis on the "Artist" field, especially if their collection contains both classical and modern music. To fix this problem, some classical collectors will swap the "Composer" and "Artist" metadata entries, greatly easing the browsing process.

These are a few of the "wrong" uses of metadata I've come across that can actually help the functionality of your devices or collections. While there are definitely some drawbacks to employing them, whether it be the time involved, or the difficulties caused by updates to software or devices, there are also some tangible benefits. I should also point out that if you upload your metadata to be used by other people, you should strongly consider NOT entering information incorrectly in your collection, so as to preserve the integrity of the metadata going out across the web. I would be greatly interested to hear of any fixes that any of you may have employed in your own music collections, or perhaps in other uses of metadata.

Sunday, February 2, 2014

Dvořák or Dvorak? Does it really matter?

One of the more significant issues I've run into in bringing some order to my classical music collection is the lack of consistency in metadata entry. Whether it be the composer, artist, or even name of the piece, standards for how metadata is entered seem to vary wildly from CD to CD, wreaking a fair amount of havoc on my ability to organize, sort, and search my collection. But this problem with metadata goes beyond just the description of items in music software, rather extending into the description of information items across the internet, and the whole digital environment as well.

To apply the title question as an example, let's look at the Czech composer Antonín Dvořák. You will probably notice the name is one that is not incredibly friendly to our American keyboards, with accents over the "i" in Antonín and "a" in Dvořák as well as the hacek, or hook, over the "r" of the latter. It only makes sense, then, despite altering the integrity of the name, that we will often prefer to enter it here as the anglicized Antonin Dvorak instead. Regardless of which form of the name is more convenient, however, if you're striving for consistency in data entry, what form should we be looking to use in the "Composer" field of a program like iTunes, Media Monkey, or Music Collector? A browse through the information that my collection of CDs actually produced belies an even bigger problem than expected:

Antonín Dvořák
Antonin Dvorak
Dvořák
Dvorak
Dvořák, Antonín 
Dvorak, Antonin
Dvořák, Antonín (1841-1904)

One can imagine how much more difficult such a trend has on my ability to quickly and efficiently find the particular works I'm looking for when browsing by Composer. Now multiply this by the 20-30 classical composers that are included in the collection, and the sometimes-odd use of the Composer field in modern music, and you've got a true mess on your hands, and a tool that is almost unusable. This doesn't even consider the similar problems that one might encounter in the Title or Artist fields. 

The fix for such a problem, in the library world, is known as authority control. Authority control is the reason that you can utilize a library catalog, and efficiently find results on a particular topic through the use of subject and author search terms that have been pre-defined by organizations such as the Library of Congress in their LC Subject Headings and LC Name Authority File. For this particular case, the LC Name Authority File uses the official entry: Dvořák, Antonín, 1841-1904, which ensures that any works created by Dvořák are uniformly described and easily searchable.

Authority control is an easy fix, in theory, but it also requires the establishment of a person or group to create and maintain the authority files, and the willingness of the affected metadata creators to accept the use of authority records. While it is possible that companies of a like field (i.e. music companies) could agree to the common use of authority records, attempting to similarly gain the allegiance of the millions of us common folk that create metadata on a daily basis is a much more severe task. I don't see a simple solution to this problem, especially considering the multi-national character of the internet and its associated metadata. It may just be that the online environment is too far gone to hope for any meaningful return to organization and control, if there ever was any. What do you think? Do you have any authority control horror stories?