Over the last month I have been looking at university documents to identify potential keywords within these. The documents have ranged from web pages about departments and course to module specifications and lecture material. By organising them into four groups, organisation level, department level, course level and module level, the flow of keywords can be visualised. This shows that the number of relevant keywords at the higher levels are fewer and more generally representative of the learning material.
The next stage is to ask the author of the module to identify the keywords they would use. These can then be compared with each other and words generated from various NLP (Natural Language Processing) parsers.
Alongside this I have been looking into the techniques used in automatic summary generation. This deal with both single documents or collections. This is on going and will post more later, but some techniques may prove useful to support automatic metadata generation.