Yelling

Why I Care About Metadata, part 2

By Elizabeth McCraw

A little while ago, I posted a Yelling column called Why I Care About Metadata, in which I defined what metadata is and demonstrated it with examples of code from this website. Now, I am going to show you in greater depth why metadata is important by demonstrating how inaccurate metadata can prevent you from finding resources. Go to Google Scholar. Open “Advanced Scholar Search.” In the exact phrase box, type "world wide web" and limit the dates to 1800-1900. Since the internet didn't exist until well after that, we shouldn't get any hits, right? When I ran this search on May 5, 2012, I got 79 results, including an article called "Web Services and Web Services Security" that Google thinks was published in 1839. In actuality, this paper was published in 2004. 1839 is the author's telephone extension.

A human would not make this gaff; anyone reading the document would instantly know that the publication date is 2004. Computers, however, need to have it spelled out explicitly. In this case, the problem lies with OCR (optical character reading) and an assumption on Google's part that a four digit number could be a year. As Google’s crawlers scan the document, they find the text string “1839," and decide that this is probably the publication date of the article. (For a fuller discussion of problems in Google's metadata, I recommend Geoff Nunberg's "Google Books: A Metadata Train Wreck".)

Is it a huge deal? After all, most people using Google aren't what we call "power users." My example used more advanced search techniques, but most people would simply input a few keywords in the search box and hit "enter." When I ran search for "cyber security" in Google on May 5, 2012, I pulled up over 98 million hits. On the first page, I got recent news, including articles published in the last day, and the White House page on cyber security. For most users, this would be more than adequate. But what about power users? Let's say you're writing a paper on the history of cyber security and you need scholarly articles from 2004. "Web Services and Web Services Security" would be ideal for you, but Google isn't going to find it given those parameters because it thinks it was written in 1839.

This is why, as a web designer, I take the time to encode the metadata to send search engines more specific information about the page. Google doesn't have to use OCR to guess when this page was created. Instead, I have that information encoded right in the document header. Want to see it? Right click on this page, select "view page source" (or the equivalent in your browser). In the eighth line, you'll see this statement:

<meta property="dc:created" content="2012-05-16" />

In simple terms, this tells the browser that this page was created on May 16, 2012. I can include other four digit numbers in the article to my heart’s content, and Google doesn’t have to wonder if they’re the year I created this document. Thanks to the metadata, it already knows.