Extracting Metadata (basic)
Posted by Dawn on August 24, 2008
Discussed with Mark Thursday 24th ish before going on holiday.
The basic interface and functionality of the automatic metadata generator is almost complete. As identified in the report on eCat’s use of metadata there are three potential sources of metadata.
- The Learning objects (LOs) content and supporting documentation.
- Persistent collections of data – data that can be reused with each new metadata file.
- System data – data that can be generated from the system architecture and file formats etc.
The prototype has been designed to use collections of data for contributor (both content, LOM 2 Lifecycle, and metadata LOM 3 Meta-metadata) information, requirement information (LOM 4 Technical) and rights information (LOM 6 Rights). Personal preferences, another collection of data, can be utilised for LOM 2.1 version, 2.2 Status, the majority of LOM 5 educational section and some aspects of the LOM 4 Technical section.
System data is captured to identify file types and size. Ideally login information from embedded organisation systems should also be used to capture the user’s personal details. This is not implemented at this time as linking in to university systems is a prolonged and difficult task. A separate flat file login process has been setup to represent this enable the user to write their details once and use many times.
The final source of metadata is that of the LO and any organisational documentation or development notes (referred to as Scripts). This data can be used to identify potential keywords and possible classification of the LO. Classification has not been explored at this stage as the LeedsMet repository (the main test bed) has not identified the classification system it is going to use with LOs. The money is currently on JAC, but we shall have to wait and see.
I have focused on word docs to start with but should be able to utilise HTML, PowerPoint and possible pdf by the end of the project. Mark did point out that there is a substantial difference between word 2003 (my current version) and word 2007 (the new XML format for vista), but there are limitations to what we can achieve here. Maybe someone else can hack that for me
The aim of the extraction process is to generate a set of potential keywords from the documents supplied by the user. I have been running tests on some student essays at the moment as their topics are easy to distinguish. This potential set is then presented to the user so they can select what they think is most appropriate or add new words. Its more of a brain stimulator than a definitive answer to the keyword generation problem.
To do this I’ve started with the basic methods used in Information Retrieval (IR) problems. Simple term frequency (TF) scans the document counting the number of times each word appears. There is usually some pre processing of the document such as removal of Stop Words and Stemming . I’ve opted for just the stopping process as stemming returns many words that don’t convey the true contextual meaning from the perspective of keywords. For example computing becomes compute.
TF can expand into various other methods on of the most common being term frequency–inverse document frequency (tf-idf). Basically a weighted measure across a document set (not a single document). This determines the highest frequency of terms that occur for each document with the least number of occurrences across the set. This can only be used if the user submits several documents to the auto generator. So it has limitations.
The final method I tested was weighted document structures. This counts terms again but adds greater weight to those that appear in headings and titles. This can be used both on a single document and on a document set.
General I found very little difference across the three methods. The top three to five terms tended to be the same (ordering was often a little different), with the next five to ten words being a mix of useful and not so useful terms. No particular method stood out from this but they all seamed to be putting relevant words at the top of their lists.
Now I need to consider how to mix these basic techniques with the different types of content the user may submit. A textual learning object may benefit from the weighted term frequency where as scripts and university documents may perform better using the cross document set tf-idf.
This entry was posted on August 24, 2008 at 3:16 pm and is filed under General, Metadata, Reflections. Tagged: Metadata, prototype. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.