A Semantic Indexing Primer
The rapid growth in unstructured data available online and in electronic form has been both a curse and a blessing for the liberal arts. While more material is publicly available than ever before, it's getting harder to find relevant data in the information sea, and current search engines have not been able to keep pace with the explosive growth of online material.
The National Institute for Technology in Liberal Education (NITLE) and Middlebury College have been experimenting with algorithms to help unstructured data organize itself into conceptually useful categories without human intervention. Part of our motivation is to find an alternative to spending prohibitive amounts of time and money on marking up course materials, documents, and online collections with metadata by hand. For many of the most common markup standards in use today, such as SCORM or Dublin Core, it can actually take longer to create markup than it did to create the course materials themselves.
As part of our research, we have created a set of pilot semantic search engines and data visualization tools based on an information retrieval technique known as latent semantic indexing (LSI). Although still in the initial stages of development, these prototypes are already showing promising results, and helping researchers in real applications. Latent semantic search engines help make unstructured data searchable in three different ways:
- They improve regular keyword searches by noticing word co-occurence patterns across the entire collection, and then using those patterns to infer semantic relationships between documents. This means that you can query a latent semantic search engine and get back relevant results that don't necessarily contain any of the keywords you entered. For example, you might search a historical database for 'Napoleon' and get back documents that contain the word 'Bonaparte', even if the word 'Napoleon' does not appear.
- Latent semantic search engines make it possible to pare down your search by showing the engine relevant documents and asking it to find similar items from its collection. This kind of iterative search enables users to hone in on specific topics without having to master an arcane query syntax.
- By finding relationships within unstructured data, semantic search engines can help automate some of the rote work of categorization and indexing that currently falls to archivists and librarians, freeing those professionals to work at a higher level. Rather than working with completely unsorted data, they can use LSI to automatically pre-sort documents into a chosen schema according to criteria they set and define.
While so far NITLE has been using LSI techniques only to search natural language document collections, we suspect that the algorithms (which are based on pure matrix algebra), can also be usefully applied to non-linguistic data sets. Some of the many exciting possibilities for extending LSI beyond language include protein and gene sequence analysis, combinatorial chemistry, astrophysics, data visualization, and medical imaging. Many of these applications represent open areas of research where undergraduate students would be able to work in uncharted territory, and potentially make a valuable contribution to their chosen field of study. You can read more about the techniques behind LSI and its potential applications in a variety of fields at http://www.knowledgesearch.org/lsi/lsa_intro.htm.