Semantic Engine C++ API

Indexing Classes

By Aaron Coburn

This file describes the Semantic Engine indexing classes, used to create a data structure sufficient for modeling large text collections. Once this model is created, it can be queried and analyzed using graph-theoretic operations.


1. Part of Speech Tagging
     1.1. Invoking the POS Tagger
     1.2. Tokenization
     1.3. Adding POS Tags
     1.4. Extracting Terms
     1.5. The File: tagger.hpp
2. Document Parsing
     2.1. Invoking the Parser
     2.2. Word Filters
     2.3. Stemming and Termlists
     2.4. The File: parsing.hpp
3. Document Indexing
     3.1. The File: indexing.hpp


1. Part of Speech Tagging

The Semantic Engine indexer identifies valid terms from each document by using only certain classes of words. Typically, this includes only nouns, but facilities are in place to identify noun phrases, verbs, adjectives and proper nouns.

The part of speech of words is significant in building an index because it obviates the need for stoplists. Instead, word categories can be excluded: pronouns and prepositions, for instance, carry very little helpful semantic information for an indexer. Semantic richness usually lies with nouns, and especially with noun phrases. So in a given document most words are not indexed at all; this reduces both the size of the index and the 'noise' of irrelevant words.

To do this, the indexer uses a Part-of-speech tagger. The tagger first tokenizes the document, separating words from punctuation. Sequentially, it then classifies each word into one of 44 part of speech categories. These categories are the same as are used in the Penn Treebank Project.

Using data relating to the frequency of certain POS tags occurring with the words in a text corpus, the tagger builds a probabilistic model for the likelihood that a given word is a certain part of speech. The most likely tag is given to the word.

This allows polysemous words, such as 'fly' to be distinguished: one may wish to keep the noun forms but discard the instances of it as a verb.

Unknown words -- words not occurring in the lexicon -- are classified according to word morphology. For instance, words ending with an -ly will likely be adverbs, while words ending with an -s are probably either plural nouns or 3rd person singular verbs.

Once all the words are tagged, terms and term counts are extracted from the tagged text and given to the user for further processing.


1.1. Invoking the POS Tagger

There are two ways to call the tagger, both of which require a lexical file formatted in a particular way. A full lexicon for English is installed as a standard part of the Semantic Engine. If no lexicon is provided to the Tagger constructor, the standard English lexicon is used. To access the default lexicon, the semantic.hpp header file is used.

1. includes+={
#include <semantic/semantic.hpp>
#include <semantic/utility.hpp>
}
This macro is defined in definitions 1 and 3.
This macro is invoked in definition 16.

In short, the tagger is invoked like so:

semantic::tagger MyTagger(lexicon);

The constructor has been designed to be simple. It will accept either a filestream containing the lexicon or the name and path to the file holding the data. If no argument is given, the default installation location is searched.

2. Tagger Constructor={
tagger( std::istream& lexicon_stream )
{
    init();
    load_lexicon( lexicon_stream );
}

tagger( const std::string& lexicon_location="" )
{
    init();
    load_lexicon( lexicon_location );
}
}
This macro is invoked in definition 16.


1.2. Tokenization

Before doing any POS tagging, a rather messy set of tokenization operations need to be performed on the text. These involve removing punctuation from the word boundaries and modifying single and double quotes into a standardized 'directional' quote (i.e. from "word" to `` word '' and from 'term' to ` term '). Furthermore, any period marking the end of a sentence is separated from the text while periods that are a part of an abbreviation are kept joined to the term. For these operations, abbreviations.hpp is needed for distinguishing among the various uses of periods. The Boost string algorithm and cctype classes are also used for case conversion and case checking.

3. includes+={
#include <semantic/abbreviations.hpp>
#include <boost/algorithm/string.hpp>
#include <cctype>
}
This macro is defined in definitions 1 and 3.
This macro is invoked in definition 16.

4. Tokenizer={
std::vector<std::string> tokenize(const std::string& text)
{
    std::string s(text);

    /* load the standard set of abbreviations */
    Abbreviations abbrs;

    
Remove HTML

    
Tokenize on whitespace

    
Separate final periods

    
Separate words from punctuation

    
Separate other periods

    return final;
}
}
This macro is invoked in definition 16.

Generally, any text that is not part of an ordinary sentence should not be passed in to the tagger (HTML, images, other markup, headers, page breaks, etc). But since HTML is so pervasive, the tokenizer first removes all HTML and HTML-like tags: anything with angle brackets, which (for me) is the most common form of markup to slip into the tagger. Any markup will cause the POS tagger to misbehave.

5. Remove HTML={
// Remove any HTML-like tags that sneaked in
std::string::size_type htmlPos = s.find_first_of("<",0);
std::string::size_type lastHtmlPos = s.find_first_of(">",htmlPos);
while( htmlPos != std::string::npos && lastHtmlPos != std::string::npos ){
    s.replace(htmlPos,lastHtmlPos-htmlPos+1," ");
    htmlPos = s.find_first_of("<",htmlPos+1);
    lastHtmlPos = s.find_first_of(">",htmlPos);
}
}
This macro is invoked in definition 4.

After removing any markup, the tagger does a first-round at tokenization, using only whitespace as a guide. The text is placed into an STL vector<string>. If, for whatever reason, the vector is empty, just short circuit the tagger.

6. Tokenize on whitespace={
// Simple tokenization on whitespace
std::vector<std::string> coll;
std::string delim = "\x20\x09\x0a\x0c\x0d\xa0\x85";
std::string::size_type lastPos = s.find_first_not_of(delim,0);
std::string::size_type startPos = s.find_first_of(delim,lastPos);

while( startPos != std::string::npos || lastPos != std::string::npos ){
    coll.push_back( s.substr(lastPos,startPos-lastPos) );
    lastPos = s.find_first_not_of(delim,startPos);
    startPos = s.find_first_of(delim,lastPos);
}

// short circuit...
if( coll.empty() )
    return coll;
}
This macro is invoked in definition 4.

If the final token ends in a period or sequence of periods, separate them from the word. (This is the easiest case to deal with, so it is done separately.)

7. Separate final periods={
// If the final token has a period, remove it
std::string elem = coll.back();
unsigned int n = 1;
std::string lchar = elem.substr(elem.length()-n,1);
if( lchar == "." ){
    while(elem.length() >= n && elem.substr(elem.length()-n,1) == "."){
        n++;
    }

    coll[coll.size()-1] = elem.substr(0,elem.size()-(n-1));
    std::string final_period;
    for( unsigned int j=1;j<n;j++){
        final_period += ".";
    }
    coll.push_back(final_period);
}
}
This macro is invoked in definition 4.

Now that a vector of word tokens has been created, loop through each item, test whether there is any punctuation present, and if so, separate the word from the punctuation using clean_punctuation(). After cleaning, the token may have whitespace introduced as a way of identifying the new breakpoint. These points are used to divide the token further. A new vector<string> called tokens is created.

8. Separate words from punctuation={
// Do the real tokenization now by separating off the punctuation
std::vector<std::string> tokens;
std::vector<std::string>::iterator pos;
for( pos = coll.begin(); pos != coll.end(); ++pos ){
    std::string tok = *pos;

    // only clean punctuation if there's something to clean
    if( tok.find_first_not_of(letters+numbers,0) != std::string::npos ){

        tok = clean_punctuation( tok );

        std::string::size_type end = tok.find_first_of( " ", 0 );
        if( end != std::string::npos ){

            // There are multiple tokens here
            tokens.push_back( tok.substr(0,end) );

            while( end != std::string::npos ){
                std::string::size_type beg = tok.find_first_not_of( " ", end );
                end = tok.find_first_of( " ", beg );
                tokens.push_back( tok.substr(beg,end-beg) );
            }

        } else {
            // There's just one token here
            tokens.push_back( tok );
        }
    } else {
        tokens.push_back( tok );
    }
}
}
This macro is invoked in definition 4.

Once all the punctuation is separated from the terms, the periods still remain to be processed. Here, a heuristic is used to identify whether the period marks the end of a sentence or the end of an abbreviation:

1. is the next word upper case or does it begin with a non-letter character?

2. is the token not listed among the list of common abbreviations in the abbreviations class?

If either of these are true, the period is treated as the end of a sentence and separated from the word.

9. Separate other periods={
/*    Now deal with all the periods, most of which should be
 *    separated from the word (i.e. when it indicates the end
 *    of a sentence), but some should be left on the word
 *    (i.e. for abbreviations) */
std::vector<std::string> final;
std::vector<std::string>::size_type size = tokens.size();
for( unsigned i=0; i<size; ++i ){
    std::string token = tokens[i];
    std::string::size_type len = token.length();
    if (    len > 1 &&
            token.substr(len-1,1) == "." &&
            i + 1 < size &&
            ( is_upper( tokens[i+1] ) || ! is_alpha( tokens[i+1] ) ) &&
            !abbrs.is_abbreviation( token ) ) {

        // Separate word and period
        // Add both to tokens
        final.push_back( token.substr(0,len-1) );
        final.push_back( "." );
    } else {
        final.push_back( token );
    }
}
}
This macro is invoked in definition 4.


1.3. Adding POS Tags

Given a string of text, the tagger tokenizes the text according to the procedure above. The tagger then builds a finite state machine as it processes each item in the list of tokens. In other words, a hidden markov model of likelihoods for different bigram tag combinations is employed: using the tag from the previous word (initially set at 'pp' or end of sentence) and the likelihood that the given word corresponds to each of the available parts of speech, the POS tag is deterministically identified.

The add_tags() method will process each token, first by cleaning the token with the clean_word() method. This mostly amounts to finding a version of the term in the word lexicon, possibly in a lower-case version, for instance. Numbers and symbols are reduced to simplified forms that can be understood by the lexicon, as are unknown words.

10. Add POS Tags={
std::string add_tags( const std::string& text )
{
    std::vector<std::string> toks =  tokenize( text );
    std::string tagged;

    for( unsigned i = 0; i < toks.size(); ++i ){
        std::string word = clean_word( toks[i] );
        std::string tag = assign_tag( word );


        tagged.append( toks[i] + "/" + tag + " " );
    }
    return tagged;
}
}
This macro is invoked in definition 16.

Unknown words are classified by word morphology and then passed in simplified form to the lexicon. For instance, 'preternaturally' doesn't occur in the lexicon, so it is classified as an '-ly' word and passed to the lexicon as "*LY*".

11. Classify unknown word={
if( has_numbers ){
    
Classify number
}

if( has_hyphen && has_letters ){
    
Classify hyphenation
}

Classify other

}
This macro is invoked in definition 16.

Words that contain numbers are classified as either numeric values (*NUM*) or ordinal values (*ORD*), each of which has a different grammatical function.

12. Classify number={
pos = word.find_first_not_of("-."+numbers);
if( pos == std::string::npos)
    return "*NUM*";

if( has_letters ){
    std::string::size_type m_pos = word.find_first_not_of(letters,pos);
    if( m_pos == std::string::npos)
        return "*ORD*";
}
pos = word.find_first_not_of("/:",pos);
if( pos != std::string::npos ){
    pos = word.find_first_not_of(numbers,pos);
    if( pos == std::string::npos)
        return "*NUM*";
}
}
This macro is invoked in definition 11.

Hyphenated words are usually either adjectives or nouns. And if the second part of the hyphenation is, itself, a word, then the term is probably an adjectival hyphenated word.

13. Classify hyphenation={
std::string::size_type lastPos = word.size()-2;
pos = word.find_last_of("-",lastPos);
if( pos > 0 &&
    word.find_last_of(letters+numbers,pos) == pos-1 &&
    word.find_first_of(letters+numbers,pos) == pos+1){
    if( LEXICON.count( word.substr(pos+1,word.size()-1-pos) ) ){
        return "*HYP-ADJ*";
    } else {
        return "*HYP*";
    }
}
}
This macro is invoked in definition 11.

For all other terms, word morphology is used to help identify how to classify them.

14. Classify other={
if( word.find_first_of("[({",0) != std::string::npos ){
    return "*LRB*";
} else if ( word.find_first_of("]})",0) != std::string::npos ){
    return "*RRB*";
} else if ( word.find_first_not_of(upper_case,0) != 0 &&
            word.find_first_not_of(upper_case+".-&",1) == std::string::npos ){
    return "*ABR*";
} else if ( word.find_first_of(letters+numbers,0) == std::string::npos ){
    return "*SYM*";
} else if ( word == upper ){
    return "*CAP*";
} else if ( word.size() >= 3 && word.substr(word.size()-3,3) == "ing" ){
    return "*ING*";
} else if ( word.substr(word.size()-1,1) == "s" ){
    return "*S*";
} else if ( word.size() >= 4 && word.substr(word.size()-4,4) == "tion" ){
    return "*TION*";
} else if ( word.size() >= 2 && word.substr(word.size()-2,2) == "ly" ){
    return "*LY*";
} else if ( word.size() >= 2 && word.substr(word.size()-2,2) == "ed" ){
    return "*ED*";
} else {
    return "*UNKNOWN*";
}
}
This macro is invoked in definition 11.

The assign_tag() method is a modified version of the Viterbi algorithm, used to identify the most likely tag for the given word.

15. Assign tag={
std::string best_tag( "NN" );
unsigned long best_so_far = 0;

LexiconValueMap &wordtags = LEXICON[word];
LexiconValueMap &hmmtags = HMM[previous_tag];

for( LexiconValueMap::iterator pos = wordtags.begin();
         pos != wordtags.end();
         ++pos ) {
    std::string tag = pos->first;
    if( hmmtags.count( tag ) ){
        unsigned long probability = hmmtags[tag] * ( pos->second + 1 );
        if( probability > best_so_far ){
            best_so_far = probability;
            best_tag = tag;
        }
    }
}
previous_tag = best_tag;
}
This macro is invoked in definition 16.


1.4. Extracting Terms

... To do ...


1.5. The File: tagger.hpp

16. File: semantic/tagger.hpp={
#ifndef __SEMANTIC_TAGGER_HPP__
#define __SEMANTIC_TAGGER_HPP__

includes

// STL Includes
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <map>
#include <set>

namespace semantic {
    class tagger {

        public:

/* *********************************************************************
 *        Generic constructor
 *********************************************************************** */
            
Tagger Constructor

/* *********************************************************************
 *        Separate words and punctuation in preparation for tagging
 *********************************************************************** */
            
Tokenizer

/* **********************************************************************
 *        Add tags to the given std::string,
 *        returns a tagged std::string
 * ********************************************************************** */
            
Add POS Tags

/* **********************************************************************
 *        Reset the internal HMM record of the previous POS tag
 * ********************************************************************** */
            void reset()
            {
                previous_tag = "PP";
            }

/* **********************************************************************
 *        Get a std::map of the given POS type from a std::string of text
 * ********************************************************************** */
            std::map<std::string,int> get_POS( const std::string& text,
                                               const std::set<std::string>& exps){

                std::vector<std::string> toks =  tokenize( text );
                std::map<std::string,int> terms;

                for( unsigned i = 0; i < toks.size(); ++i ){
                    std::string word = clean_word( toks[i] );
                    std::string tag = assign_tag( word );
                    if( exps.count(tag)){
                        terms[toks[i]]++;
                    }
                }
                return terms;
            }

/* **********************************************************************
 *        Get the proper noun phrases from the given text string
 * ********************************************************************** */
            std::map<std::string,int> get_proper_nouns( const std::string& text ){
                std::map<std::string,int> phrases = get_noun_phrases(text);
                std::map<std::string,int>::iterator pos;
                std::set<std::string> to_erase;
                for( pos = phrases.begin(); pos != phrases.end(); ++pos){
                    std::pair<
                        std::vector<std::string>,
                        std::vector<std::string>
                        > phrase = vectorize_tagged_text(pos->first);
                    std::vector<std::string> words = phrase.first;
                    std::vector<std::string> tags = phrase.second;

                    // erase any noun phrases that have lower case nouns and adjectives
                    for( std::vector<std::string>::size_type i = words.size() - 1;
                             i >= 0; --i ){
                        std::string t = simplify_tag(tags[i]);
                        if( ( t == "N" || t == "M" ) && is_lower(words[i]) ){
                            to_erase.insert(pos->first);
                            i = 0; // short circuit
                        }
                    }
                }
                std::set<std::string>::iterator epos;
                for( epos = to_erase.begin(); epos != to_erase.end(); ++epos ){
                    phrases.erase(*epos);
                }

                return phrases;
            }

/* **********************************************************************
 *        Get the maximal noun phrases from the given text string
 * ********************************************************************** */
            std::map<std::string,int> get_maximal_noun_phrases( const std::string& text ){
                std::vector<std::string> toks = tokenize( text );
                std::map<std::string,int> terms;
                std::string simplified;
                std::vector<std::string> words;
                std::vector<std::string> tags;
                for( unsigned i = 0; i < toks.size(); ++i ){
                    std::string tag = assign_tag( clean_word( toks[i] ) );
                    words.push_back( toks[i] );
                    tags.push_back( tag );
                    simplified += simplify_tag( tag );
                }

                std::vector<std::pair<std::string::size_type,
                                      std::string::size_type> > mnps;
                try {
                    mnps = find_mnp(simplified);
                } catch ( std::exception& e){
                    std::cerr << "Error finding Maximal Noun Phrase patterns: "
                              << e.what() << std::endl;
                }
                std::vector<std::pair<std::string::size_type,
                                      std::string::size_type> >::iterator pos;
                for( pos = mnps.begin(); pos != mnps.end(); ++pos ){
                    std::string term;
                    std::string::size_type start = pos->first;
                    std::string::size_type mnp_length = pos->second;
                    for( std::string::size_type i = start;
                             i < start + mnp_length; ++i){
                        term += words[i] + "/" + tags[i];
                        if( i < start+mnp_length-1 ){
                            term += " ";
                        }
                    }
                    terms[term]++;
                }
                return terms;
            }


/* **********************************************************************
 *        Get the noun phrases from the given text string
 * ********************************************************************** */
            std::map<std::string,int> get_noun_phrases( const std::string& text ){
                // get the maximal noun phrases

                std::map<std::string,int> terms;
                try {
                    terms = get_maximal_noun_phrases(text);
                } catch ( std::exception& e ){
                    std::cerr << "Error getting Maximal Noun Phrases: "
                              << e.what() << std::endl;
                }
                std::map<std::string,int> subphrases;
                std::map<std::string,int>::iterator pos;

                // loop through the MNPs and extract subphrases
                for( pos = terms.begin(); pos != terms.end(); ++pos ){

                    // first put the phrase string back into vectors
                    // and make a simplified version of the tag list
                    std::pair<std::vector<std::string>,
                              std::vector<std::string>
                        > words_and_tags = vectorize_tagged_text(pos->first);
                    std::vector<std::string> words = words_and_tags.first;
                    std::vector<std::string> tags = words_and_tags.second;
                    std::vector<std::string>::iterator vpos;
                    std::string tag_phrase;
                    for( vpos = tags.begin(); vpos != tags.end(); ++vpos){
                        tag_phrase += simplify_tag(*vpos);
                    }


                    // split up the noun phrase into subphrases by splitting
                    // on prepositions (P), articles (A), and numbers (D)
                    std::string::size_type ppos = tag_phrase.find_first_not_of("D",0);
                    std::string::size_type lastPpos = 0;
                    ppos = tag_phrase.find_first_of("PAD",ppos);

                        while( ppos != std::string::npos){
                        std::string subphrase = tag_phrase.substr(lastPpos,ppos-lastPpos);

                        // record resulting subphrase
                        std::string phrase;
                        for( unsigned i = 0; i < subphrase.size(); ++i ){
                            phrase += words[lastPpos+i] + "/" + tags[lastPpos+i];
                            if( i < subphrase.size() - 1){
                                phrase += " ";
                            }
                        }
                        subphrases[phrase]++;

                        // if the subphrase is longer than a single word, keep breaking
                        // it down by iteratively removing the leading word
                        if( subphrase.size() > 1){
                            for( unsigned spos = 0; spos < subphrase.size() -1; spos++){
                                // record first word, if it's a noun
                                if( subphrase.substr(spos,1) == "N" ){
                                    std::string term = words[lastPpos+spos] +
                                                       "/" + tags[lastPpos+spos];
                                    subphrases[term]++;
                                }
                                // record remainder of subphrase -- always a valid NP
                                std::string term;
                                for( unsigned i = 0; i < subphrase.size() - spos - 1; ++i ){
                                    term += words[lastPpos+spos+i+1] +
                                            "/" + tags[lastPpos+spos+i+1];
                                    if( i < subphrase.size() - spos - 1){
                                        term += " ";
                                    }
                                }
                                subphrases[term]++;
                            }
                        }
                        ppos = tag_phrase.find_first_not_of("PAD",ppos);
                        lastPpos = ppos;
                        ppos = tag_phrase.find_first_of("PAD",ppos);
                    }

                    // record the last subphrase -- this may be the same as the
                    // tag_phrase if there were no prepositions or articles

                    std::string subphrase = tag_phrase.substr(lastPpos,
                                                              tag_phrase.size()-lastPpos);
                    if( subphrase.size() < tag_phrase.size() ){
                        // record if different than original phrase
                        std::string phrase;
                        for( unsigned i = 0; i < subphrase.size(); ++i ){
                            phrase += words[lastPpos+i] + "/" + tags[lastPpos+i];
                            if( i < subphrase.size() - 1){
                                phrase += " ";
                            }
                        }
                        subphrases[phrase]++;

                    }

                    if( subphrase.size() > 1){
                        for( unsigned spos = 0; spos < subphrase.size() -1; spos++){
                            // record first word, if it's a noun
                            if( subphrase.substr(spos,1) == "N" ){
                                std::string term = words[lastPpos+spos] +
                                                   "/" + tags[lastPpos+spos];
                                subphrases[term]++;
                            }
                            // record remainder of subphrase
                            std::string term;
                            for( unsigned i = 0; i < subphrase.size() - spos - 1; ++i ){
                                term += words[lastPpos+spos+i+1] +
                                        "/" + tags[lastPpos+spos+i+1];
                                if( i < subphrase.size() - spos - 2){
                                    term += " ";
                                }
                            }
                            subphrases[term]++;
                        }
                    }
                }

                // add all the found subphrases onto the termlist
                for( pos = subphrases.begin(); pos != subphrases.end(); ++pos){
                    terms[pos->first] += pos->second;
                }
                return terms;
            }



/* **********************************************************************
 *        Remove POS tags from the std::map of terms
 * ********************************************************************** */
            std::map<std::string,int> remove_tags( const std::map<std::string,int>& wordmap ){
                std::map<std::string,int> terms;
                std::map<std::string,int>::const_iterator pos;

                for( pos = wordmap.begin(); pos != wordmap.end(); ++pos ){
                    std::string term(pos->first);
                    std::string::size_type tpos = term.find_last_of("/");
                    std::string::size_type lastPos = term.size();
                    while( tpos != std::string::npos ){
                        term.replace(tpos,lastPos-tpos,"");
                        tpos = term.find_last_of(" ",tpos-1);
                        lastPos = tpos;
                        tpos = term.find_last_of("/",tpos);
                    }
                    terms[term] += pos->second;
                }
                return terms;
            }


/* **********************************************************************
 *        Get the nouns from the given text string
 * ********************************************************************** */
            std::map<std::string,int> get_nouns( const std::string& text ){
                std::set<std::string> nouns;
                nouns.insert("NN");
                nouns.insert("NNP");
                nouns.insert("NNS");
                nouns.insert("NNPS");
                return get_POS( text, nouns );
            }


/* **********************************************************************
 *        Get the verbs from the given text string
 * ********************************************************************** */
            std::map<std::string,int> get_verbs( const std::string& text ){
                std::set<std::string> verbs;
                verbs.insert("VB");
                verbs.insert("VBD");
                verbs.insert("VBG");
                verbs.insert("VBN");
                verbs.insert("VBP");
                verbs.insert("VBZ");
                return get_POS( text, verbs );
            }


/* **********************************************************************
 *        Get the adjectives from the given text string
 * ********************************************************************** */
            std::map<std::string,int> get_adjectives( const std::string& text ){
                std::set<std::string> adjectives;
                adjectives.insert("JJ");
                adjectives.insert("JJS");
                adjectives.insert("JJR");
                return get_POS( text, adjectives );
            }








/*%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#
%#
#%            PRIVATE FUNCTIONS
%#
#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%*/
        private:
            std::string previous_tag;
            std::string letters;
            std::string upper_case;
            std::string numbers;



            typedef maps::ordered<std::string, unsigned long> LexiconValueMap;
            typedef maps::unordered<std::string, LexiconValueMap> LexiconMap;
            LexiconMap LEXICON;
            LexiconMap HMM;


            void init(){
                reset();
                numbers = "0123456789";
                upper_case = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
                letters = "abcdefghijklmnopqrstuvwxyz"+upper_case;
            }





/* **********************************************************************
 *        Find MNPs within the given tagged text
 * ********************************************************************** */
            std::vector<std::pair<std::string::size_type,
                                  std::string::size_type>
                            > find_mnp( const std::string tag_string ){
                std::vector<std::pair<std::string::size_type,
                                      std::string::size_type> > mnps;
                std::string::size_type pos = tag_string.find_first_of("N",0);
                while( pos != std::string::npos ){
                    std::string::size_type start = pos;
                    std::string::size_type end;
                    if( pos > 0 ){
                        std::string::size_type temp = tag_string.find_last_not_of("M",pos-1);
                        if( temp != std::string::npos && tag_string.substr(temp,1) == "D"){
                            start = temp;
                        } else {
                            start = temp + 1;
                        }
                    }

                    pos = tag_string.find_first_not_of("N",pos);
                    if( pos == std::string::npos ){
                        end = tag_string.size()-1;
                    } else {
                        end = pos - 1;
                    }
                    bool keep_going = true;
                    while( pos != std::string::npos &&
                           tag_string.substr(pos,1) != "X" &&
                           keep_going ){
                        pos = tag_string.find_first_not_of("P",pos);
                        std::string::size_type n = tag_string.find_first_not_of("A",pos) - pos;
                        if( n == 1 || n == 0 ){ // proceed
                            pos += n;
                            n = tag_string.find_first_not_of("D",pos) - pos;
                            if( n == 1 || n == 0 ){ // proceed
                                pos += n;
                                pos = tag_string.find_first_not_of("M",pos);
                                if( pos != std::string::npos &&
                                          tag_string.substr(pos,1) == "N"){ // found Noun head
                                    // go to end of N-head
                                    pos = tag_string.find_first_not_of("N",pos);
                                    if( pos == std::string::npos ){
                                        end = tag_string.size()-1;
                                    } else {
                                        end = pos - 1;
                                    }
                                } else {
                                    keep_going = false;
                                }
                            } else {
                                keep_going = false;
                            }
                        } else {
                            keep_going = false;
                        }
                    }
                    mnps.push_back( std::make_pair(start,end-start+1));
                    pos = tag_string.find_first_of("N",pos);
                }
                return mnps;
            }




/* **********************************************************************
 *        Used by `find_mnp`
 * ********************************************************************** */
            std::string simplify_tag( const std::string t ){
                std::string s;
                if( t.substr(0,2) == "NN"){
                    // noun
                    return "N";
                } else if( t.substr(0,2) == "JJ" || t == "VBN" || t == "VBG"){
                    // modifier
                    return "M";
                } else if( t == "IN" || t == "TO"){
                    // preposition
                    return "P";
                } else if( t == "DT"){
                    // article
                    return "A";
                } else if( t == "CD" ){
                    // number
                    return "D";
                } else {
                    // some other tag
                    return "X";
                }
            }


/* **********************************************************************
 *        Used by `tokenize`
 * ********************************************************************** */
            std::string clean_punctuation( const std::string& s )
            {

                std::string tok(s);

                std::string::size_type pos;

                try {
                    pos = tok.find_first_of("([{}])!?#$;~|%",0);
                    while( pos != std::string::npos ){
                        tok.insert(pos+1," ");
                        tok.insert(pos," ");
                        pos = tok.find_first_of("([{}])!?#$;~|%", pos+3);
                    }
                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (1): " << e.what() << std::endl;
                }

                try {
                    pos = tok.find("--",0);
                    while( pos != std::string::npos ){
                        while( pos+1 != std::string::npos && tok.substr(pos+1,1) == "-"){
                            tok.replace(pos+1,1,"");
                        }
                        tok.insert(pos+1," ");
                        tok.insert(pos," ");
                        pos = tok.find("--",pos+3);
                    }

                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (2): " << e.what() << std::endl;
                }

                try {

                    pos = tok.find("...",0);
                    while( pos != std::string::npos ){
                        int length = 3;
                        while( pos+length != std::string::npos && tok.substr(pos+length,1) == "."){
                            length++;
                        }
                        tok.replace(pos,length,"...");
                        tok.insert(pos+3," ");
                        tok.insert(pos," ");
                        pos = tok.find("...",pos+3);
                    }

                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (3): " << e.what() << std::endl;
                }

                try {


                    pos = tok.find_first_of( "`", 0);
                    while( pos != std::string::npos ){
                        if( tok.substr(pos+1,1) == "`"){
                            tok.insert(pos+2," ");
                        } else {
                            tok.insert(pos+1," ");
                        }
                        tok.insert(pos," ");
                        pos = tok.find_first_of("`",pos+3);
                    }

                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (4): " << e.what() << std::endl;
                }

                try {

                    pos = tok.find( "''", 0);
                    while( pos != std::string::npos ){
                        if( pos+2 != std::string::npos ){
                            tok.insert(pos+2," ");
                        }
                        tok.insert(pos," ");
                            pos = tok.find("''",pos+2);
                    }

                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (5): " << e.what() << std::endl;
                }

                try {

                    pos = tok.find_first_of(",",0);
                    while( pos != std::string::npos ){
                        std::string::size_type m_pos = tok.find_first_of(numbers,pos);
                        if( pos+1 != std::string::npos && m_pos == pos+1){
                            pos = tok.find_first_of(",",pos+1);
                        } else {
                            tok.insert(pos+1," ");
                            tok.insert(pos," ");
                            pos = tok.find_first_of(",", pos+3);
                        }
                    }

                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (6): " << e.what() << std::endl;
                }

                try {

                    pos = tok.find_first_of("\"",0);
                    while( pos != std::string::npos){
                        if( tok.find_first_of(letters,pos+1) != std::string::npos){
                            tok.replace(pos,1," `` ");
                        } else {
                            tok.replace(pos,1," '' ");
                        }
                        pos = tok.find_first_of("\"", pos+4 );
                    }
                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (7): " << e.what() << std::endl;
                }



                    if( tok.substr(tok.size()-1,1) == ":"){
                        tok.insert(tok.size()-1," ");
                    }

                    if( tok.size() > 1 && tok.substr(tok.size()-1,1) == "."){
                        std::string::size_type m_pos = tok.find_last_not_of(letters,tok.size()-2);
                        if( m_pos < tok.size() - 2){
                            tok.insert(tok.size()-1," ");
                        }
                    }

                try {

                    pos = tok.find_first_of("'",0);
                    while( pos != std::string::npos){

                        std::string next_char;
                        std::string prev_char;
                        bool prev_is_alphaNum = false;
                        bool next_is_alphaNum = false;
                        if( pos < tok.size()-1){
                            next_char = tok.substr(pos+1,1);
                            if( next_char.find_first_of(letters+numbers,0) != std::string::npos)
                                next_is_alphaNum=true;

                        }

                        if( pos > 0){
                            prev_char = tok.substr(pos-1,1);
                            if( prev_char.find_first_of(letters+numbers,0) != std::string::npos)
                                prev_is_alphaNum=true;

                        }
                        if( prev_char == "."){
                            tok.insert(pos," ");
                            pos = tok.find_first_of("'",pos+2);
                        } else if( pos < tok.size()-1 && tok.substr(pos+1,1) == "'") {
                            // next char is a single quote (')
                            pos = tok.find_first_of("'",pos+2);
                        } else if( next_is_alphaNum && prev_is_alphaNum ){
                            // next and prev char is alpha-numeric
                            // is it a contraction?
                            if( prev_char == "n" && next_char == "t" ){
                                tok.insert(pos-1," ");
                                pos = tok.find_first_of("'",pos+3);
                            } else if ( next_char == "d" || next_char == "m" || next_char == "s" ){
                                if( pos == tok.size()-2 ||
                                    tok.find_first_not_of(letters+numbers,pos+2) == pos+2 ){
                                    tok.insert(pos," ");
                                    pos = tok.find_first_of("'",pos+3);
                                } else {
                                    pos = tok.find_first_of("'",pos+2);
                                }
                            } else if ( tok.find("'ve", pos)==pos ||
                                        tok.find("'ll",pos)==pos ||
                                        tok.find("'re",pos)==pos){
                                tok.insert(pos," ");
                                pos = tok.find_first_of("'",pos+4);
                            } else {
                                pos = tok.find_first_of("'",pos+1);
                            }

                        } else if ( next_is_alphaNum && !prev_is_alphaNum ){ // begins word
                            tok.replace(pos,1," ` ");
                            pos = tok.find_first_of("'",pos+3);
                        } else if ( !next_is_alphaNum && prev_is_alphaNum ){ // ends word
                            tok.replace(pos,1," ' ");
                            pos = tok.find_first_of("'",pos+3);
                        } else {
                            pos = tok.find_first_of("'",pos+1);
                        }

                    }
                } catch ( std::exception &e ) {
                    std::cerr << "Parsing Error (8): " << e.what() << std::endl;
                }
                boost::algorithm::trim(tok);
                return tok;

            }




/* **********************************************************************
 *        Shortcut for taking a tagged string and turning it
 *        back into two vectors of words and tags
 * ********************************************************************** */
            std::pair<std::vector<std::string>,
                      std::vector<std::string>
                      > vectorize_tagged_text(const std::string tagged){
                std::vector<std::string> words;
                std::vector<std::string> tags;
                std::string::size_type word_start = 0;
                std::string::size_type tag_end = tagged.find_first_of(" ",0);
                std::string::size_type divider;
                while( tag_end != std::string::npos ){
                    divider = tagged.find_last_of("/",tag_end);
                    words.push_back(tagged.substr(word_start,divider-word_start));
                    tags.push_back(tagged.substr(divider+1,tag_end-divider-1));
                    word_start = tagged.find_first_not_of(" ",tag_end);
                    tag_end = tagged.find_first_of(" ",word_start);
                }
                divider = tagged.find_last_of("/",tagged.size());
                words.push_back(tagged.substr(word_start,divider-word_start));
                tags.push_back(tagged.substr(divider+1,tagged.size()-divider-1));
                return std::make_pair(words,tags);
            }




/* *********************************************************************
 *        Load the lexicon into a 2-dimensional std::map
 *********************************************************************** */
            void load_lexicon( const std::string& filename="" )
            {
                if (!filename.empty()) {
                    if (try_loading_lexicon(filename)) return;
                    if (try_loading_lexicon(LEXICON_INSTALL_LOCATION)) return;

                    std::cerr << "Couldn't load lexicon from either "
                              << filename << " or "
                              << LEXICON_INSTALL_LOCATION << ".\n";
                    exit(EXIT_FAILURE);
                }

                if (try_loading_lexicon(LEXICON_INSTALL_LOCATION)) return;
                std::cerr << "Couldn't load lexicon from "
                          << LEXICON_INSTALL_LOCATION << ".\n";
                exit(EXIT_FAILURE);
            }

            bool try_loading_lexicon( const std::string &filename ) {
                std::ifstream file;
                file.open(filename.c_str(), std::ios_base::in | std::ios_base::binary);
                if( file ){
                    load_lexicon(file);
                    file.clear();
                    file.close();
                    return true;
                }
                return false;
            }

            void load_lexicon( std::istream& lex_stream ){
                std::string lex = "hmm";
                std::string line;

                while (std::getline(lex_stream,line)) {

                    if( line.compare(0,10,"## Lexicon") == 0 ){
                        lex = "lex";
                    } else if ( line.compare(0,22,"## Hidden Markov Model") == 0){
                        lex = "hmm";
                    }

                    if ( line.compare(0,1,"#") != 0 ) {
                        std::string::size_type pos = line.find_first_of(" ", 0);
                        std::string word = line.substr(0,pos-1);
                        std::string delim = " {},:";
                        std::string::size_type lastPos = line.find_first_not_of(delim,pos);
                        pos = line.find_first_of(delim,lastPos);


                        while( pos != std::string::npos || lastPos != std::string::npos ){

                            // POS TAG
                            std::string tag = line.substr(lastPos,pos-lastPos);
                            lastPos = line.find_first_not_of(delim,pos);
                            pos = line.find_first_of(delim,lastPos);

                            // Frequency
                            unsigned long value;
                            std::string num = line.substr(lastPos,pos-lastPos);
                            value = strtoul( num.c_str(), NULL, 10 );
                            lastPos = line.find_first_not_of(delim,pos);
                            pos = line.find_first_of(delim,lastPos);

                            // Record
                            if( lex == "lex" ){
                                LEXICON[word][tag] = value;
                            } else if ( lex == "hmm" ) {
                                HMM[word][tag] = value;
                            }
                        }

                    }
                }
            }




/* **********************************************************************
 *        Viterbi Algorithm --> Bayesian logic applied to values in lexica
 * ********************************************************************** */
            std::string assign_tag( const std::string& word )
            {
                if( word == "*SYM*")
                    return "SYM";

                
Assign tag
                return best_tag;
            }

/*        A slightly slower version of the above     */
            std::string assign_tag_slower( const std::string& word )
            {
                if( word == "*SYM*")
                    return "SYM";

                std::string best_tag( "NN" );
                unsigned long best_so_far = 0;

                LexiconValueMap &wordtags = LEXICON[word];
                LexiconValueMap &hmmtags = HMM[previous_tag];

                for(LexiconValueMap::iterator pos = hmmtags.begin(); pos != hmmtags.end(); ++pos) {
                    std::string tag = pos->first;
                    if( wordtags.count( tag ) ){
                        unsigned long probability = pos->second * ( wordtags[tag] + 1 );
                        if( probability > best_so_far ){
                            best_so_far = probability;
                            best_tag = tag;
                        }
                    }
                }
                previous_tag = best_tag;
                return best_tag;
            }




/* *********************************************************************
 *        Tell me whether this is an upper/lower case word or not
 *********************************************************************** */
            bool is_upper( const std::string& s ){
                if( s.size() > 0 &&
                    boost::algorithm::all(s.substr(0,1), boost::algorithm::is_upper()) ){
                    return true;
                } else {
                    return false;
                }
            }

            bool is_lower( const std::string& s ){
                if( s.size() > 0 &&
                    boost::algorithm::all(s.substr(0,1), boost::algorithm::is_lower()) ){
                    return true;
                } else {
                    return false;
                }
            }

            bool is_alpha( const std::string& s ){
                if( s.size() > 0 &&
                    boost::algorithm::all(s.substr(0,1), boost::algorithm::is_alpha()) ){
                    return true;
                } else {
                    return false;
                }
            }



/* *********************************************************************
 *        Decide what form of the word to analyze
 *********************************************************************** */
            std::string clean_word( const std::string& word )
            {

                std::string lower(word);
                std::transform( word.begin(), word.end(), lower.begin(), tolower );

                // if it looks like this word starts a sentence, try lower case first
                if( previous_tag == "PP" && is_upper(word) ){
                    // Does the word exist as a lower case word?
                    if( LEXICON.count( lower ) && word != "I" ){
                        // NOTE: it would be better to have an actual comparison count
                        // here -- use lower if more common than upper...
                        return lower;
                    }
                }

                // Does the word exist in the lexicon as-is?
                if( LEXICON.count( word ) ){
                    return word;
                }

                // Does the word exist as a lower case word?
                if( LEXICON.count( lower ) ){
                    return lower;
                }


                // Otherwise, classify by word morphology
                return classify_unknown_word( word );

            }



/* *********************************************************************
 *        Classify unknown words according to word morphology
 *********************************************************************** */
            std::string classify_unknown_word( const std::string& word )
            {

                std::string upper(word);
                upper[0] = toupper(word[0]);

                std::string::size_type pos;

                bool has_letters = false;
                bool has_numbers = false;
                bool has_hyphen = false;

                if( word.find_first_of("-",0) != std::string::npos)
                    has_hyphen = true;

                if( word.find_first_of(letters,0) != std::string::npos)
                    has_letters = true;

                if( word.find_first_of(numbers,0) != std::string::npos)
                    has_numbers = true;

                
Classify unknown word
            }
    }; // class Tagger
} // namespace semantic

#endif


}
This macro is attached to an output file.


2. Document Parsing

The Semantic Engine's document parser is a wrapper around the POS tagger with a few added functions built in, such as stemming, and term-based filtering. The parser will process a text, extract the relevant terms and frequencies and filter the results. If terms are stemmed, it also keeps track of the most frequently appearing unstemmed version.


2.1. Invoking the Parser

The text_parser is invoked exactly like the tagger class. It can be given the location of a POS lexicon, the entire lexicon as an istream, or no parameter, in which case it will default to the English lexicon installed with the Semantic Engine toolkit.

17. Parser Constructor={
text_parser( const std::string &lexicon_location="" )
    : parser(lexicon_location) {
    enable_stemming = true;
}
text_parser( std::istream& lexicon_stream )
    : parser( lexicon_stream ){
    enable_stemming = true;
}
}
This macro is invoked in definition 22.

Once the parser is invoked, the user can set which parsing method to use, that is, what grammatical constructs should be extracted as terms. This defaults to 'nouns', but other options include: 'noun_phrases', 'proper_nouns', 'maximal_noun_phrases', 'verbs' and 'adjectives'.

18. Set Parsing Method={
void set_parsing_method( const std::string& pattern){
    POS_pattern = pattern;
}
}
This macro is invoked in definition 22.


2.2. Word Filters

Once a text is parsed, each word is passed through a regimen of filters. The user can add an arbitrary number of filters: the filters implemented in the Semantic Engine are described in the filters.hpp documentation.

If any word fails to pass a filter test, it is added to a vector<string> of items that will be removed from the map<string,int> of terms.

19. Word Filters={

for( std::vector<WordPtr>::iterator ptr = word_filters.begin();
         ptr != word_filters.end(); ++ptr ){
     WordPtr filter = *ptr;

    std::vector<std::string> to_erase;  // list of terms to erase
    for( std::map<std::string,int>::iterator i = terms.begin();
             i != terms.end(); ++i ){
        std::string key = i->first;
        if( ! (*filter)(key) ){ // check filter, erase if false
            to_erase.push_back( key );
        }
    }

    // actually do the deletions
    for( std::vector<std::string>::iterator i = to_erase.begin();
             i != to_erase.end(); ++i ){
        terms.erase(*i);
    }
}
}
This macro is invoked in definition 22.

Word filters can be added simply by passing them to the add_word_filter() method. A pointer is added to the word_filters vector<WordPtr> and the filter is applied during parsing.

20. Add Word Filter={
template <class Filter>
void add_word_filter(Filter f) {
    word_filters.push_back( WordPtr( new Filter(f) ) );
}
}
This macro is invoked in definition 22.


2.3. Stemming and Termlists

Word stemming can be turned on or off. It is enabled by default and it controls whether the terms returned by the parser are fully stemmed or not.

21. Set Stemming={
void set_stemming(bool val){
    enable_stemming = val;
}
}
This macro is invoked in definition 22.

The stemmer employs the Porter stemmer

If terms are stemmed, the parser provides a facility for retrieving the unstemmed version of each stemmed word. By passing a map<string,map<string,int>> as a second parameter to the parser, the data structure will grow as the parsing proceeds. The data is keyed to the stemmed terms, each of which point to a map<string,int> of each original term and occurrence count.


2.4. The File: parsing.hpp

22. File: semantic/parsing.hpp={
#ifndef __SEMANTIC_PARSING_HPP__
#define __SEMANTIC_PARSING_HPP__

#include <semantic/tagger.hpp>
#include <semantic/stem/english_stem.h>
#include <semantic/filter.hpp>
#include <boost/shared_ptr.hpp>

#include <map>
#include <vector>
#include <string>
#include <cctype>
#include <iostream>

namespace semantic {

/* *******************************************************
        TEXT PARSER
   ******************************************************* */
    class text_parser {

        public:
            typedef std::map<std::string,int> UnstemmedCount;

            
Parser Constructor

/* ************************************************************************
    set_parsing_method( std::string ) -- set how the POS tagger gets words
            the value can be:     'noun_phrases', 'adjectives', 'verbs',
                                'proper_nouns', or 'nouns' (default)
   ************************************************************************ */
            
Set Parsing Method


/* ************************************************************************
    parse ( stream ) -- parse a text stream, returning stemmed terms
   ************************************************************************ */
            std::map<std::string,int> parse( std::istream& instream ){
                std::map<std::string,UnstemmedCount> unstemmed;
                return parse( instream, unstemmed );
            }

/* ************************************************************************
    parse ( stream, unstemmed_lookup ) -- parse a text stream, returning stemmed terms
   ************************************************************************ */
            std::map<std::string,int> parse(
                           std::istream& instream,
                           std::map<std::string,UnstemmedCount>& unstemmed ){
                std::map<std::string,int> words;
                std::string line;
                while( std::getline(instream,line) ){
                    std::map<std::string,int> mywords = parse( line, unstemmed );
                    std::map<std::string,int>::iterator pos;
                    for( pos = mywords.begin(); pos != mywords.end(); ++pos ){
                        words[pos->first] += pos->second;
                    }
                }
                return words;
            }

/* ************************************************************************
    parse ( string ) -- parse a text string, returning stemmed terms
   ************************************************************************ */
            std::map<std::string,int> parse( const std::string& intext ){
                std::map<std::string,UnstemmedCount> unstemmed;
                return parse( intext, unstemmed );
            }


/* ************************************************************************
    parse ( string, unstemmed_lookup ) -- parse a text string, returning
                                          stemmed terms, keeping original
                                          terms in the unstemmed_lookup
   ************************************************************************ */
            std::map<std::string,int> parse(
                          const std::string& intext,
                          std::map<std::string,UnstemmedCount>& unstemmed ) {

                // POS tag
                std::map<std::string,int> terms;

                try {
                    if( POS_pattern == "noun_phrases"){
                        terms = parser.get_noun_phrases( intext );
                    } else if ( POS_pattern == "adjectives"){
                        terms = parser.get_adjectives( intext );
                    } else if ( POS_pattern == "maximal_noun_phrases"){
                        terms = parser.get_maximal_noun_phrases( intext );
                    } else if ( POS_pattern == "proper_nouns") {
                        terms = parser.get_proper_nouns( intext );
                    } else if ( POS_pattern == "verbs") {
                        terms = parser.get_verbs( intext );
                    } else {
                        terms = parser.get_nouns( intext );
                    }
                    terms = parser.remove_tags( terms );

                } catch ( std::exception& e ){
                    std::string POS = POS_pattern;
                    if( POS.length() == 0 ){
                        POS = "nouns";
                    }
                    std::cerr << "Error getting " << POS << ": " << e.what() << std::endl;
                }

                // pass through a word filter
                
Word Filters

                return stem ( terms, unstemmed );
            }




/* ************************************************************************
    add_word_filter ( class word_filter ) -- add a word-based filter
                                             to the parser
   ************************************************************************ */
            
Add Word Filter

            
Set Stemming

        private:
            typedef boost::shared_ptr<word_filter> WordPtr;

            tagger parser;
            bool enable_stemming;
            std::string POS_pattern;
            std::vector<WordPtr> word_filters;
            std::map<std::string,int> unstemmed;


/* ************************************************************************
    stem ( terms, unstemmed_lookup ) -- stem the words in `terms` map
                                        and keep the original word forms
                                        in the `unstemmed_lookup`
   ************************************************************************ */
            std::map<std::string,int> stem(
                        const std::map<std::string,int>& terms,
                        std::map<std::string,UnstemmedCount>& unstemmed ){

                stemming::english_stem Stemmer;
                std::map<std::string,int> stemmed;
                std::map<std::string,int>::const_iterator pos;
                for( pos = terms.begin(); pos != terms.end(); ++pos ){
                    std::string word( pos->first );
                    std::string original(word);
                    if( enable_stemming ){
                        Stemmer(word);
                    }
                    if( word.length() > 0 ){

                        // Make lower case
                        std::string lower(word);
                        std::transform(word.begin(),word.end(),lower.begin(),tolower);

                        stemmed[lower] += pos->second;
                        unstemmed[lower][original]++;
                        unstemmed[lower]["__count"]++;
                    }
                }
                return stemmed;

            }




    };




}

#endif

}
This macro is attached to an output file.


3. Document Indexing


3.1. The File: indexing.hpp

23. File: semantic/indexing.hpp={#ifndef __SEMANTIC_INDEXING_HPP__
#define __SEMANTIC_INDEXING_HPP__

/*
classes to help with indexing texts (or other things) into a Semantic Graph
*/



#include <semantic/semantic.hpp>
#include <semantic/parsing.hpp>
#include <semantic/filter.hpp>
#include <semantic/file_reader.hpp>

#include <map>
#include <sstream>
#include <string>
#include <cctype>
#include <utility>
#include <iostream>


using std::string;

namespace semantic {

    template <class Graph>
    class text_indexing_helper {
        typedef se_graph_traits<Graph> graph_traits;
        typedef typename graph_traits::vertex_descriptor Vertex;
        typedef typename graph_traits::edge_descriptor Edge;
        public:
            explicit text_indexing_helper(Graph &graph) : g(graph) {}

            std::pair<Edge, Edge> add_doc_term_edge(const string& doc_string,
                                                    const string& term_string,
                                                    unsigned int strength) {
                Vertex doc, term;
                doc = add_doc_vertex(doc_string);
                term = add_term_vertex(term_string);

                typename graph_traits::edge_properties_type eprop;
                eprop.strength = strength;

                return std::make_pair(add_edge(doc, term, eprop, g).first,
                                      add_edge(term, doc, eprop, g).first);
            }

            Edge add_single_doc_term_edge(const string& doc_string,
                                          const string& term_string,
                                          unsigned int strength) {
                Vertex doc, term;
                doc = add_doc_vertex(doc_string);
                term = add_term_vertex(term_string);

                typename graph_traits::edge_properties_type eprop;
                eprop.strength = strength;

                return add_edge(doc, term, eprop, g).first;
            }

            Graph &graph() { return g; }
            const Graph &graph() const { return g; }


        protected:
            Graph &g;

        private:
            Vertex add_term_vertex(string term) {
                if (term.find_first_of(' ') != std::string::npos) return add_phrase_vertex(term);
                return do_vertex(term, node_type_major_term, node_type_minor_term, term_cache);
            }

            Vertex add_phrase_vertex(string phrase){
                return do_vertex(phrase, node_type_major_term, node_type_minor_phrase, term_cache);
            }

            Vertex add_doc_vertex(string doc) {
                return do_vertex(doc, node_type_major_doc, node_type_minor_undef, doc_cache);
            }

            Vertex do_vertex(string content,
                             int type_major,
                             int type_minor,
                             std::map<string, Vertex> &cache) {
                if (cache.count(content)) return cache[content];
                typename graph_traits::vertex_properties_type vprop;
                vprop.content = content;
                vprop.type_major = type_major;
                vprop.type_minor = type_minor;

                Vertex v = add_vertex(vprop, g);
                cache.insert(std::make_pair<string, Vertex>(content, v));
                return v;
            }

            std::map<string, Vertex> doc_cache;
            std::map<string, Vertex> term_cache;

    };

    template <class Graph>
    class text_indexer: public text_indexing_helper<Graph> {
        typedef std::map<std::string,int> UnstemmedCount;
        typedef text_indexing_helper<Graph> base_type;

        public:
/* **************************************************** *
 *        CONSTRUCTOR
 *
 *        text_indexer<Graph> ( Graph, lexicon_filename or lexicon_istream )
 * **************************************************** */
            text_indexer(Graph & graph, const std::string& lexicon_location="")
                     : base_type(graph), parser(lexicon_location)
            {
                init();
            }
            text_indexer(Graph & graph, std::istream& lexicon_stream)
                     : base_type(graph), parser(lexicon_stream)
            {
                init();
            }

/* **************************************************** *
 *        METHODS
 *
 *        add_word_filter ( class Filter )
 * **************************************************** */
            template <class Filter>
            void add_word_filter( Filter f ){
                parser.add_word_filter( f );
            }

            void set_parsing_method(const std::string& method){
                parser.set_parsing_method(method);
            }

/* **************************************************** *
 *        index ( filename, filestream, [mime_type], [weight=1] )
 * **************************************************** */
             std::string index( const std::string& filename,
                                std::istream& filestream,
                                const std::string& mime_type,
                                const int multiplier=1 )
            {
                file_reader reader;
                if( default_encoding.size() > 0 )
                    reader.set_default_encoding( default_encoding );

                std::string text = reader( filestream, mime_type );
                smart_quotes_filter filter;
                text = filter(text);

                if( text.size() > 10 ){
                    add_to_index( filename, text, multiplier );
                }
                return text;
            }

            std::string index( const std::string& filename,
                               std::istream& filestream,
                               int multiplier=1 )
            {
                std::string mime_type = get_mime_type_from_filename( filename );
                return index( filename, filestream, mime_type, multiplier );
            }





/* **************************************************** *
 *        index ( filename, [weight=1] )
 * **************************************************** */
            std::string index( const std::string& filename, const int multiplier=1 )
            {
                file_reader reader;
                if( default_encoding.size() > 0 )
                    reader.set_default_encoding( default_encoding );

                if( pdfLayout.size() && pdfLayout != "layout")
                    reader.set_pdfLayout( pdfLayout );

                std::string text = reader( filename );
                if( text.size() > 10 ){
                    add_to_index( filename, text, multiplier );
                }
                return text;
            }

/* **************************************************** *
 *        index ( doc_id, text, [weight=1] )
 * **************************************************** */
            std::string index( const std::string& doc_id,
                               const std::string& text,
                               const int multiplier=1){
                std::string m_text;
                std::string m_encoding;
                if ( default_encoding.size() > 0 ){
                    m_encoding = default_encoding;
                } else {
                    m_encoding = "iso-8859-1";
                }
                if( m_encoding != "utf8"){
                    m_text = convert_to_utf8(text,m_encoding);
                } else {
                    m_text = text;
                }
                smart_quotes_filter filter;
                add_to_index(doc_id, filter(m_text), multiplier);
                return m_text;
            }



/* **************************************************** *
 *        unindex ( doc_id )
 * **************************************************** */
            void unindex( const std::string& id ){
                typename se_graph_traits<Graph>::vertex_descriptor u =
                        base_type::g.vertex_by_id(
                            base_type::g.fetch_vertex_id_by_content_and_type(
                                id, node_type_major_doc
                            ) );
                clear_vertex(u,base_type::g);
            }


/* **************************************************** *
 *        reindex ( filename, [weight=1] )
 * **************************************************** */
            std::string reindex( const std::string& filename, const int multiplier=1){
                file_reader reader;
                if( default_encoding.size() > 0 )
                    reader.set_default_encoding( default_encoding );

                if( pdfLayout.size() && pdfLayout != "layout")
                    reader.set_pdfLayout( pdfLayout );

                std::string text = reader( filename );
                std::string existing_text =
                        base_type::g.get_vertex_meta_value(
                            base_type::g.vertex_by_id(
                                base_type::g.fetch_vertex_id_by_content_and_type(
                                    filename, node_type_major_doc
                                ) ), "body");
                if( existing_text != text ){
                    unindex(filename);
                    index( filename, text, multiplier );
                }
                return text;
            }


/* **************************************************** *
 *        reindex ( doc_id, text, [weight=1] )
 * **************************************************** */
            std::string reindex( const std::string& doc_id,
                                 const std::string& text,
                                 const int multiplier=1){
                std::string existing_text =
                        base_type::g.get_vertex_meta_value(
                            base_type::g.vertex_by_id(
                                base_type::g.fetch_vertex_id_by_content_and_type(
                                    doc_id, node_type_major_doc
                                ) ), "body");
                if( existing_text != text ){
                    unindex(doc_id);
                    index( doc_id, text, multiplier );
                }
                return text;
            }


/* **************************************************** *
 *        prune_wordlist(min=2)
 *
 *        This will prune out less common unstemmed variations of each word
 *        Any word occurring less often than 'min' will be omitted from the
 *        data structure
 * **************************************************** */
            std::map<std::string,
                     std::pair<std::string,int> > prune_wordlist(const int min_occurrence=2){

                std::map<std::string, std::pair<std::string,int> > new_wordlist;
                typedef std::multimap< int, std::string, std::greater<int> > RevMap;

                std::multimap<std::string,UnstemmedCount>::iterator pos;
                for( pos = wordlist.begin(); pos != wordlist.end(); ++pos ){
                    std::string stemmed = pos->first;
                    UnstemmedCount unstemmed = pos->second;

                    RevMap lookup;
                    int total = wordlist[stemmed]["__count"];
                    UnstemmedCount::iterator cpos;
                    for( cpos = unstemmed.begin(); cpos != unstemmed.end(); ++cpos ){
                        if( cpos->first != "__count" && total >= min_occurrence ){
                            lookup.insert( make_pair( cpos->second, cpos->first ) );
                        }
                    }
                    if( lookup.size() >= 1 ){
                        std::string word = lookup.begin()->second;
                        std::string lower(word);
                        std::transform(word.begin(),word.end(),lower.begin(),tolower);

                        // skip words that don't have differing stems
                        if( lower != stemmed ){
                            new_wordlist.insert(
                                std::make_pair(stemmed, std::make_pair(word, total) ) );
                        }
                    }
                }
                return new_wordlist;
            }

/* **************************************************** *
 *        get_unstemmed_words()
 * **************************************************** */
            std::map<std::string,UnstemmedCount> get_unstemmed_words(const int max=2)
            {
                manage_wordlist(max);
                return wordlist;
            }


/* **************************************************** *
 *        serialize_wordlist(min=2)
 *
 *        this method takes the wordlist, prunes out the less
 *        commonly occurring variations of words and puts
 *        them into a long string in this format --- stem,word:count;
 *        any words occurring less often than the 'min' value are excluded
 *
 * **************************************************** */


            std::string serialize_wordlist(const int min_occurrence=2){
                std::string serialization;
                std::map<std::string,
                         std::pair<std::string,int>
                    > my_wordlist = prune_wordlist(min_occurrence);
                std::map<std::string, std::pair<std::string, int> >::iterator pos;
                for( pos = my_wordlist.begin(); pos != my_wordlist.end(); ++pos ){
                    std::string stem = pos->first;
                    std::pair<std::string,int> termCount = pos->second;

                    std::string word = termCount.first;
                    int count = termCount.second;
                    if( stem.find_first_of(",:;") == std::string::npos &&
                            word.find_first_of(",:;") == std::string::npos ){
                        // any phrases occurring only once get omitted.
                        if( stem.find_first_of(" ") == std::string::npos || count > 1 ){
                            std::ostringstream oss;
                            oss << count;
                            serialization.append(stem+","+word+":"+oss.str()+";");
                        }
                    }
                }
                return serialization;
            }

            std::string get_collection_value(const std::string key,
                                             const std::string default_value){
                return base_type::g.get_meta_value(key, default_value);
            }

            void set_collection_value(const std::string key, const std::string& value){
                base_type::g.set_meta_value(key,value);
            }

/* **************************************************** *
 *         finish( min=2 )
 * **************************************************** */
            bool finish(int min=2){

                try {
                    base_type::g.commit_changes_to_storage();
                } catch ( std::exception &e){
                    std::cerr << "Error Indexing to Database: " << e.what() << std::endl;
                    return false;
                }

                try {
                    std::string wordlist = serialize_wordlist(min);
                    base_type::g.set_meta_value("wordlist", wordlist);
                }    catch ( std::exception &e ) {
                    std::cerr << "Error getting wordlist" << e.what() << std::endl;
                }


                std::map<std::string,std::string>::iterator pos;
                if( storeText ){
                    for( pos = text_store.begin(); pos != text_store.end(); ++pos ){
                        std::string filename = pos->first;
                        std::string text = pos->second;

                        typename se_graph_traits<Graph>::vertex_descriptor u;
                        try {
                            u = base_type::g.vertex_by_id(
                                    base_type::g.fetch_vertex_id_by_content_and_type(
                                        filename, node_type_major_doc));
                            // attach the text
                            base_type::g.set_vertex_meta_value(u, "body", text);
                        } catch ( std::exception &e){
                            //std::cerr << "Error: " << e.what() << std::endl;
                            continue;
                        }
                    }
                }
                return true;
            }


/* **************************************************** *
 *        set_default_encoding( std::string& encoding )
 * **************************************************** */
            void set_default_encoding( const std::string& encoding )
            {
                default_encoding = encoding;
            }

            void set_pdf_layout( const std::string& layout){
                pdfLayout = layout;
            }

            void set_stemming(bool val){
                parser.set_stemming(val);
            }

            void store_text(bool val){
                storeText = val;
            }
            std::string pdfLayout;
            std::map<std::string,std::string> text_store;


        private:
            std::map<std::string,UnstemmedCount> wordlist;
            text_parser parser;
            std::string default_encoding;
            int files_indexed;
            bool storeText;

            void init(){
                pdfLayout = "layout";
                files_indexed = 0;
                storeText = true;
                text_store.clear();
            }

            void add_to_index( const std::string& doc_id,
                               const std::string& text,
                               const int multiplier )
            {
                files_indexed++;
                //std::cout << "adding: " << doc_id << " => " << text << std::endl;
                if( storeText ){
                    text_store[doc_id] = text;
                }
                std::map<std::string,int> terms = parser.parse( text, wordlist );
                std::map<std::string,int>::iterator tpos;
                std::string value = base_type::g.get_meta_value("doc_min","1");
                int min = atoi(value.c_str());
                for( tpos = terms.begin(); tpos != terms.end(); ++tpos ){
                    std::string term = tpos->first;
                    if( tpos->second >= min ){
                        unsigned int weight = tpos->second * multiplier;
                        // std::cout << "adding: " << doc_id << " - " << term << std::endl;
                        base_type::add_doc_term_edge( doc_id, term, weight );
                    }
                }

                if( !(files_indexed % 5000 ) ){
                    manage_wordlist();
                }
            }



            void manage_wordlist(const unsigned int max=3)
            {
                typedef std::multimap<int,std::string,std::less<int> > RevMap;
                std::multimap<std::string,std::string> to_delete;
                std::multimap<std::string,UnstemmedCount>::iterator pos;
                for( pos = wordlist.begin(); pos != wordlist.end(); ++pos ){
                    std::string stemmed = pos->first;
                    UnstemmedCount unstemmed = pos->second;

                    RevMap lookup;
                    if( unstemmed.size() > max + 1 ){
                        UnstemmedCount::iterator cpos;
                        for( cpos = unstemmed.begin(); cpos != unstemmed.end(); ++cpos ){
                            if( cpos->first != "__count" ){
                                lookup.insert( make_pair( cpos->second, cpos->first ) );
                            }
                        }
                        RevMap::iterator spos;
                        unsigned int i = lookup.size();
                        for( spos = lookup.begin(); spos != lookup.end(); ++spos ){
                            if( i-- > max ){
                                to_delete.insert( make_pair( stemmed, spos->second ) );
                            }
                        }
                    }
                }
                std::map<std::string,std::string>::iterator dpos;
                for( dpos = to_delete.begin(); dpos != to_delete.end(); ++dpos ){
                    wordlist[dpos->first].erase(dpos->second);
                }
            }

/*
            RevMap lookup;
            int total = wordlist[stemmed]["__count"];
            UnstemmedCount::iterator cpos;
            for( cpos = unstemmed.begin(); cpos != unstemmed.end(); ++cpos ){
                if( cpos->first != "__count" && total >= min_occurrence ){
                    lookup.insert( make_pair( cpos->second, cpos->first ) );
                }
            }
            if( lookup.size() >= 1 ){
                std::string word = lookup.begin()->second;
                std::string lower(word);
                std::transform(word.begin(),word.end(),lower.begin(),tolower);

                // skip words that don't have differing stems
                if( lower != stemmed ){
                    new_wordlist.insert( std::make_pair(stemmed, std::make_pair(word, total) ) );
                }
            }
*/


            std::string get_mime_type_from_filename( const std::string& filename )
            {
                std::string::size_type pos = filename.find_last_of(".");
                std::string ext = filename.substr(pos+1,filename.size()-pos-1);
                if( ext == "doc" ){
                    return "application/msword";
                } else if ( ext == "rtf" ){
                    return "application/rtf";
                } else if ( ext == "html" || ext == "htm" ){
                    return "text/html";
                } else if ( ext == "txt" ){
                    return "text/plain";
                } else {
                    std::cerr << "Couldn't determine MIME type from filename: "
                              << filename << std::endl
                              << "Trying \"text/plain\"..." << std::endl;
                    return "text/plain";
                }
            }



    };


} // namespace semantic



#endif



}
This macro is attached to an output file.


End Of File