Thursday, September 15, 2011

Using MS Word to extract metadata from a report

I have been dealing with metadata here at the center for the past 6 years and have seen it slowly evolve from a painful and tedious process of copy-and-pasting to generate hundreds of files, to what it is today...a more robust, automated system where metadata is mined directly from the raw data files.

There was still one missing piece to the automated system of metadata creation, and after my last cruise to the Necker Ridge I've come up with one solution for it!  The missing piece has always been that wordy part of the metadata...the one that deals with the attribute accuracy reports, the abstracts and the process descriptions.  These things can't generally be mined from the original data.

After this particular cruise I was reading and reformatting the cruise report and it dawned on me...there is all the metadata I've been after!  (Why it took me 6 years to finally really read the report is beyond me!)  I just needed a way to tag it and save the tagged portions so I could do as I please with them.

I also needed to do this "tagging" in such a way that NO ONE was intimidated or overwhelmed by it!  So, that meant just doing it in word, since this seems to be how most of the reports are written up.

This tagging will actually be user defined XML that when saved as XML is the exact format that the XSD specifies.  Ahhh, XSD, so someone with some XML experience will have to set the whole thing up...but, once it's done...it's done forever!

  1. First set up an XSD (an XML schema) to define the layout of the document.
  2. Go to Word Options -> Popular and click on: show developer tab in the ribbon.
  3. Select the Developer tab and click on Structure.  That will show a side bar (on the right) for the XML structure.
  4. Click the link in the XML Structure window (Templates and Add-Ins)
    1. Click the Add Schema button and find the desired XSD.
    2. Fill in the Schema Settings with any URI (just a tag name can be anything descriptive), and an alias (to shorten the URI if desired).
    3. You will then see it in the list of Available XML Schemas, Check the schema to make sure it is being used. 
    4. Also, you will probably want to check "Validate document against attached schemas"
  5. After you click OK you will see a list of elements available to "apply to your current selection"...now you are ready to "tag" any part of your report (thus marking it up with XML)!
  6. When it is all marked up, select "Save As.." and choose Word 2003 XML Document (*.xml)
    1. At this point you can also apply a Transform (XSLT) and it will save transformed!
    2. Make sure you select "Save data only" if you only want the tagged data in the .xml file.
You will then see the desired marked up XML document!  All from the known familiar Word document!
    With this we are now able to auto generate the core metadata items from the raw data and the textual part of the metadata can now be harvested directly from the cruise report, one simple command to call the python script that pulls it all together and we have XML metadata for each of our hundreds of files ready to be archived at NGDC!

    No comments:

    Post a Comment