|
Semantic Engine C++ API
|
|
Indexing Classes
|
|
By Aaron Coburn
|
| This file describes the Semantic Engine indexing classes, used to create a data structure sufficient for modeling large text collections. Once this model is created, it can be queried and analyzed using graph-theoretic operations. |
|
The Semantic Engine indexer identifies valid terms from each document by using only certain classes of words. Typically, this includes only nouns, but facilities are in place to identify noun phrases, verbs, adjectives and proper nouns. The part of speech of words is significant in building an index because it obviates the need for stoplists. Instead, word categories can be excluded: pronouns and prepositions, for instance, carry very little helpful semantic information for an indexer. Semantic richness usually lies with nouns, and especially with noun phrases. So in a given document most words are not indexed at all; this reduces both the size of the index and the 'noise' of irrelevant words. To do this, the indexer uses a Part-of-speech tagger. The tagger first tokenizes the document, separating words from punctuation. Sequentially, it then classifies each word into one of 44 part of speech categories. These categories are the same as are used in the Penn Treebank Project. Using data relating to the frequency of certain POS tags occurring with the words in a text corpus, the tagger builds a probabilistic model for the likelihood that a given word is a certain part of speech. The most likely tag is given to the word. This allows polysemous words, such as 'fly' to be distinguished: one may wish to keep the noun forms but discard the instances of it as a verb. Unknown words -- words not occurring in the lexicon -- are classified according to word morphology. For instance, words ending with an -ly will likely be adverbs, while words ending with an -s are probably either plural nouns or 3rd person singular verbs. Once all the words are tagged, terms and term counts are extracted from the tagged text and given to the user for further processing. |
|
There are two ways to call the tagger, both of which require a lexical file formatted in a particular way. A full lexicon for English is installed as a standard part of the Semantic Engine. If no lexicon is provided to the Tagger constructor, the standard English lexicon is used. To access the default lexicon, the semantic.hpp header file is used. |
1. includes+={
#include <semantic/semantic.hpp>
#include <semantic/utility.hpp>
}
|
In short, the tagger is invoked like so: semantic::tagger MyTagger(lexicon); The constructor has been designed to be simple. It will accept either a filestream containing the lexicon or the name and path to the file holding the data. If no argument is given, the default installation location is searched. |
2. Tagger Constructor={
tagger( std::istream& lexicon_stream )
{
init();
load_lexicon( lexicon_stream );
}
tagger( const std::string& lexicon_location="" )
{
init();
load_lexicon( lexicon_location );
}
}
|
Before doing any POS tagging, a rather messy set of tokenization operations need to be performed on the text. These involve removing punctuation from the word boundaries and modifying single and double quotes into a standardized 'directional' quote (i.e. from "word" to `` word '' and from 'term' to ` term '). Furthermore, any period marking the end of a sentence is separated from the text while periods that are a part of an abbreviation are kept joined to the term. For these operations, abbreviations.hpp is needed for distinguishing among the various uses of periods. The Boost string algorithm and cctype classes are also used for case conversion and case checking. |
3. includes+={
#include <semantic/abbreviations.hpp>
#include <boost/algorithm/string.hpp>
#include <cctype>
}
4. Tokenizer={
std::vector<std::string> tokenize(const std::string& text)
{
std::string s(text);
/* load the standard set of abbreviations */
Abbreviations abbrs;
Remove HTML
Tokenize on whitespace
Separate final periods
Separate words from punctuation
Separate other periods
return final;
}
}
|
Generally, any text that is not part of an ordinary sentence should not be passed in to the tagger (HTML, images, other markup, headers, page breaks, etc). But since HTML is so pervasive, the tokenizer first removes all HTML and HTML-like tags: anything with angle brackets, which (for me) is the most common form of markup to slip into the tagger. Any markup will cause the POS tagger to misbehave. |
5. Remove HTML={
// Remove any HTML-like tags that sneaked in
std::string::size_type htmlPos = s.find_first_of("<",0);
std::string::size_type lastHtmlPos = s.find_first_of(">",htmlPos);
while( htmlPos != std::string::npos && lastHtmlPos != std::string::npos ){
s.replace(htmlPos,lastHtmlPos-htmlPos+1," ");
htmlPos = s.find_first_of("<",htmlPos+1);
lastHtmlPos = s.find_first_of(">",htmlPos);
}
}
|
After removing any markup, the tagger does a first-round at tokenization, using only whitespace as a guide. The text is placed into an STL vector<string>. If, for whatever reason, the vector is empty, just short circuit the tagger. |
6. Tokenize on whitespace={
// Simple tokenization on whitespace
std::vector<std::string> coll;
std::string delim = "\x20\x09\x0a\x0c\x0d\xa0\x85";
std::string::size_type lastPos = s.find_first_not_of(delim,0);
std::string::size_type startPos = s.find_first_of(delim,lastPos);
while( startPos != std::string::npos || lastPos != std::string::npos ){
coll.push_back( s.substr(lastPos,startPos-lastPos) );
lastPos = s.find_first_not_of(delim,startPos);
startPos = s.find_first_of(delim,lastPos);
}
// short circuit...
if( coll.empty() )
return coll;
}
|
If the final token ends in a period or sequence of periods, separate them from the word. (This is the easiest case to deal with, so it is done separately.) |
7. Separate final periods={
// If the final token has a period, remove it
std::string elem = coll.back();
unsigned int n = 1;
std::string lchar = elem.substr(elem.length()-n,1);
if( lchar == "." ){
while(elem.length() >= n && elem.substr(elem.length()-n,1) == "."){
n++;
}
coll[coll.size()-1] = elem.substr(0,elem.size()-(n-1));
std::string final_period;
for( unsigned int j=1;j<n;j++){
final_period += ".";
}
coll.push_back(final_period);
}
}
|
Now that a vector of word tokens has been created, loop through each item, test whether there is any punctuation present, and if so, separate the word from the punctuation using clean_punctuation(). After cleaning, the token may have whitespace introduced as a way of identifying the new breakpoint. These points are used to divide the token further. A new vector<string> called tokens is created. |
8. Separate words from punctuation={
// Do the real tokenization now by separating off the punctuation
std::vector<std::string> tokens;
std::vector<std::string>::iterator pos;
for( pos = coll.begin(); pos != coll.end(); ++pos ){
std::string tok = *pos;
// only clean punctuation if there's something to clean
if( tok.find_first_not_of(letters+numbers,0) != std::string::npos ){
tok = clean_punctuation( tok );
std::string::size_type end = tok.find_first_of( " ", 0 );
if( end != std::string::npos ){
// There are multiple tokens here
tokens.push_back( tok.substr(0,end) );
while( end != std::string::npos ){
std::string::size_type beg = tok.find_first_not_of( " ", end );
end = tok.find_first_of( " ", beg );
tokens.push_back( tok.substr(beg,end-beg) );
}
} else {
// There's just one token here
tokens.push_back( tok );
}
} else {
tokens.push_back( tok );
}
}
}
|
Once all the punctuation is separated from the terms, the periods still remain to be processed. Here, a heuristic is used to identify whether the period marks the end of a sentence or the end of an abbreviation: 1. is the next word upper case or does it begin with a non-letter character? 2. is the token not listed among the list of common abbreviations in the abbreviations class? If either of these are true, the period is treated as the end of a sentence and separated from the word. |
9. Separate other periods={
/* Now deal with all the periods, most of which should be
* separated from the word (i.e. when it indicates the end
* of a sentence), but some should be left on the word
* (i.e. for abbreviations) */
std::vector<std::string> final;
std::vector<std::string>::size_type size = tokens.size();
for( unsigned i=0; i<size; ++i ){
std::string token = tokens[i];
std::string::size_type len = token.length();
if ( len > 1 &&
token.substr(len-1,1) == "." &&
i + 1 < size &&
( is_upper( tokens[i+1] ) || ! is_alpha( tokens[i+1] ) ) &&
!abbrs.is_abbreviation( token ) ) {
// Separate word and period
// Add both to tokens
final.push_back( token.substr(0,len-1) );
final.push_back( "." );
} else {
final.push_back( token );
}
}
}
|
Given a string of text, the tagger tokenizes the text according to the procedure above. The tagger then builds a finite state machine as it processes each item in the list of tokens. In other words, a hidden markov model of likelihoods for different bigram tag combinations is employed: using the tag from the previous word (initially set at 'pp' or end of sentence) and the likelihood that the given word corresponds to each of the available parts of speech, the POS tag is deterministically identified. The add_tags() method will process each token, first by cleaning the token with the clean_word() method. This mostly amounts to finding a version of the term in the word lexicon, possibly in a lower-case version, for instance. Numbers and symbols are reduced to simplified forms that can be understood by the lexicon, as are unknown words. |
10. Add POS Tags={
std::string add_tags( const std::string& text )
{
std::vector<std::string> toks = tokenize( text );
std::string tagged;
for( unsigned i = 0; i < toks.size(); ++i ){
std::string word = clean_word( toks[i] );
std::string tag = assign_tag( word );
tagged.append( toks[i] + "/" + tag + " " );
}
return tagged;
}
}
|
Unknown words are classified by word morphology and then passed in simplified form to the lexicon. For instance, 'preternaturally' doesn't occur in the lexicon, so it is classified as an '-ly' word and passed to the lexicon as "*LY*". |
11. Classify unknown word={
if( has_numbers ){
Classify number
}
if( has_hyphen && has_letters ){
Classify hyphenation
}
Classify other
}
|
Words that contain numbers are classified as either numeric values (*NUM*) or ordinal values (*ORD*), each of which has a different grammatical function. |
12. Classify number={
pos = word.find_first_not_of("-."+numbers);
if( pos == std::string::npos)
return "*NUM*";
if( has_letters ){
std::string::size_type m_pos = word.find_first_not_of(letters,pos);
if( m_pos == std::string::npos)
return "*ORD*";
}
pos = word.find_first_not_of("/:",pos);
if( pos != std::string::npos ){
pos = word.find_first_not_of(numbers,pos);
if( pos == std::string::npos)
return "*NUM*";
}
}
|
Hyphenated words are usually either adjectives or nouns. And if the second part of the hyphenation is, itself, a word, then the term is probably an adjectival hyphenated word. |
13. Classify hyphenation={
std::string::size_type lastPos = word.size()-2;
pos = word.find_last_of("-",lastPos);
if( pos > 0 &&
word.find_last_of(letters+numbers,pos) == pos-1 &&
word.find_first_of(letters+numbers,pos) == pos+1){
if( LEXICON.count( word.substr(pos+1,word.size()-1-pos) ) ){
return "*HYP-ADJ*";
} else {
return "*HYP*";
}
}
}
|
For all other terms, word morphology is used to help identify how to classify them. |
14. Classify other={
if( word.find_first_of("[({",0) != std::string::npos ){
return "*LRB*";
} else if ( word.find_first_of("]})",0) != std::string::npos ){
return "*RRB*";
} else if ( word.find_first_not_of(upper_case,0) != 0 &&
word.find_first_not_of(upper_case+".-&",1) == std::string::npos ){
return "*ABR*";
} else if ( word.find_first_of(letters+numbers,0) == std::string::npos ){
return "*SYM*";
} else if ( word == upper ){
return "*CAP*";
} else if ( word.size() >= 3 && word.substr(word.size()-3,3) == "ing" ){
return "*ING*";
} else if ( word.substr(word.size()-1,1) == "s" ){
return "*S*";
} else if ( word.size() >= 4 && word.substr(word.size()-4,4) == "tion" ){
return "*TION*";
} else if ( word.size() >= 2 && word.substr(word.size()-2,2) == "ly" ){
return "*LY*";
} else if ( word.size() >= 2 && word.substr(word.size()-2,2) == "ed" ){
return "*ED*";
} else {
return "*UNKNOWN*";
}
}
|
The assign_tag() method is a modified version of the Viterbi algorithm, used to identify the most likely tag for the given word. |
15. Assign tag={
std::string best_tag( "NN" );
unsigned long best_so_far = 0;
LexiconValueMap &wordtags = LEXICON[word];
LexiconValueMap &hmmtags = HMM[previous_tag];
for( LexiconValueMap::iterator pos = wordtags.begin();
pos != wordtags.end();
++pos ) {
std::string tag = pos->first;
if( hmmtags.count( tag ) ){
unsigned long probability = hmmtags[tag] * ( pos->second + 1 );
if( probability > best_so_far ){
best_so_far = probability;
best_tag = tag;
}
}
}
previous_tag = best_tag;
}
|
... To do ...
|
16. File: semantic/tagger.hpp={
#ifndef __SEMANTIC_TAGGER_HPP__
#define __SEMANTIC_TAGGER_HPP__
includes
// STL Includes
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <map>
#include <set>
namespace semantic {
class tagger {
public:
/* *********************************************************************
* Generic constructor
*********************************************************************** */
Tagger Constructor
/* *********************************************************************
* Separate words and punctuation in preparation for tagging
*********************************************************************** */
Tokenizer
/* **********************************************************************
* Add tags to the given std::string,
* returns a tagged std::string
* ********************************************************************** */
Add POS Tags
/* **********************************************************************
* Reset the internal HMM record of the previous POS tag
* ********************************************************************** */
void reset()
{
previous_tag = "PP";
}
/* **********************************************************************
* Get a std::map of the given POS type from a std::string of text
* ********************************************************************** */
std::map<std::string,int> get_POS( const std::string& text,
const std::set<std::string>& exps){
std::vector<std::string> toks = tokenize( text );
std::map<std::string,int> terms;
for( unsigned i = 0; i < toks.size(); ++i ){
std::string word = clean_word( toks[i] );
std::string tag = assign_tag( word );
if( exps.count(tag)){
terms[toks[i]]++;
}
}
return terms;
}
/* **********************************************************************
* Get the proper noun phrases from the given text string
* ********************************************************************** */
std::map<std::string,int> get_proper_nouns( const std::string& text ){
std::map<std::string,int> phrases = get_noun_phrases(text);
std::map<std::string,int>::iterator pos;
std::set<std::string> to_erase;
for( pos = phrases.begin(); pos != phrases.end(); ++pos){
std::pair<
std::vector<std::string>,
std::vector<std::string>
> phrase = vectorize_tagged_text(pos->first);
std::vector<std::string> words = phrase.first;
std::vector<std::string> tags = phrase.second;
// erase any noun phrases that have lower case nouns and adjectives
for( std::vector<std::string>::size_type i = words.size() - 1;
i >= 0; --i ){
std::string t = simplify_tag(tags[i]);
if( ( t == "N" || t == "M" ) && is_lower(words[i]) ){
to_erase.insert(pos->first);
i = 0; // short circuit
}
}
}
std::set<std::string>::iterator epos;
for( epos = to_erase.begin(); epos != to_erase.end(); ++epos ){
phrases.erase(*epos);
}
return phrases;
}
/* **********************************************************************
* Get the maximal noun phrases from the given text string
* ********************************************************************** */
std::map<std::string,int> get_maximal_noun_phrases( const std::string& text ){
std::vector<std::string> toks = tokenize( text );
std::map<std::string,int> terms;
std::string simplified;
std::vector<std::string> words;
std::vector<std::string> tags;
for( unsigned i = 0; i < toks.size(); ++i ){
std::string tag = assign_tag( clean_word( toks[i] ) );
words.push_back( toks[i] );
tags.push_back( tag );
simplified += simplify_tag( tag );
}
std::vector<std::pair<std::string::size_type,
std::string::size_type> > mnps;
try {
mnps = find_mnp(simplified);
} catch ( std::exception& e){
std::cerr << "Error finding Maximal Noun Phrase patterns: "
<< e.what() << std::endl;
}
std::vector<std::pair<std::string::size_type,
std::string::size_type> >::iterator pos;
for( pos = mnps.begin(); pos != mnps.end(); ++pos ){
std::string term;
std::string::size_type start = pos->first;
std::string::size_type mnp_length = pos->second;
for( std::string::size_type i = start;
i < start + mnp_length; ++i){
term += words[i] + "/" + tags[i];
if( i < start+mnp_length-1 ){
term += " ";
}
}
terms[term]++;
}
return terms;
}
/* **********************************************************************
* Get the noun phrases from the given text string
* ********************************************************************** */
std::map<std::string,int> get_noun_phrases( const std::string& text ){
// get the maximal noun phrases
std::map<std::string,int> terms;
try {
terms = get_maximal_noun_phrases(text);
} catch ( std::exception& e ){
std::cerr << "Error getting Maximal Noun Phrases: "
<< e.what() << std::endl;
}
std::map<std::string,int> subphrases;
std::map<std::string,int>::iterator pos;
// loop through the MNPs and extract subphrases
for( pos = terms.begin(); pos != terms.end(); ++pos ){
// first put the phrase string back into vectors
// and make a simplified version of the tag list
std::pair<std::vector<std::string>,
std::vector<std::string>
> words_and_tags = vectorize_tagged_text(pos->first);
std::vector<std::string> words = words_and_tags.first;
std::vector<std::string> tags = words_and_tags.second;
std::vector<std::string>::iterator vpos;
std::string tag_phrase;
for( vpos = tags.begin(); vpos != tags.end(); ++vpos){
tag_phrase += simplify_tag(*vpos);
}
// split up the noun phrase into subphrases by splitting
// on prepositions (P), articles (A), and numbers (D)
std::string::size_type ppos = tag_phrase.find_first_not_of("D",0);
std::string::size_type lastPpos = 0;
ppos = tag_phrase.find_first_of("PAD",ppos);
while( ppos != std::string::npos){
std::string subphrase = tag_phrase.substr(lastPpos,ppos-lastPpos);
// record resulting subphrase
std::string phrase;
for( unsigned i = 0; i < subphrase.size(); ++i ){
phrase += words[lastPpos+i] + "/" + tags[lastPpos+i];
if( i < subphrase.size() - 1){
phrase += " ";
}
}
subphrases[phrase]++;
// if the subphrase is longer than a single word, keep breaking
// it down by iteratively removing the leading word
if( subphrase.size() > 1){
for( unsigned spos = 0; spos < subphrase.size() -1; spos++){
// record first word, if it's a noun
if( subphrase.substr(spos,1) == "N" ){
std::string term = words[lastPpos+spos] +
"/" + tags[lastPpos+spos];
subphrases[term]++;
}
// record remainder of subphrase -- always a valid NP
std::string term;
for( unsigned i = 0; i < subphrase.size() - spos - 1; ++i ){
term += words[lastPpos+spos+i+1] +
"/" + tags[lastPpos+spos+i+1];
if( i < subphrase.size() - spos - 1){
term += " ";
}
}
subphrases[term]++;
}
}
ppos = tag_phrase.find_first_not_of("PAD",ppos);
lastPpos = ppos;
ppos = tag_phrase.find_first_of("PAD",ppos);
}
// record the last subphrase -- this may be the same as the
// tag_phrase if there were no prepositions or articles
std::string subphrase = tag_phrase.substr(lastPpos,
tag_phrase.size()-lastPpos);
if( subphrase.size() < tag_phrase.size() ){
// record if different than original phrase
std::string phrase;
for( unsigned i = 0; i < subphrase.size(); ++i ){
phrase += words[lastPpos+i] + "/" + tags[lastPpos+i];
if( i < subphrase.size() - 1){
phrase += " ";
}
}
subphrases[phrase]++;
}
if( subphrase.size() > 1){
for( unsigned spos = 0; spos < subphrase.size() -1; spos++){
// record first word, if it's a noun
if( subphrase.substr(spos,1) == "N" ){
std::string term = words[lastPpos+spos] +
"/" + tags[lastPpos+spos];
subphrases[term]++;
}
// record remainder of subphrase
std::string term;
for( unsigned i = 0; i < subphrase.size() - spos - 1; ++i ){
term += words[lastPpos+spos+i+1] +
"/" + tags[lastPpos+spos+i+1];
if( i < subphrase.size() - spos - 2){
term += " ";
}
}
subphrases[term]++;
}
}
}
// add all the found subphrases onto the termlist
for( pos = subphrases.begin(); pos != subphrases.end(); ++pos){
terms[pos->first] += pos->second;
}
return terms;
}
/* **********************************************************************
* Remove POS tags from the std::map of terms
* ********************************************************************** */
std::map<std::string,int> remove_tags( const std::map<std::string,int>& wordmap ){
std::map<std::string,int> terms;
std::map<std::string,int>::const_iterator pos;
for( pos = wordmap.begin(); pos != wordmap.end(); ++pos ){
std::string term(pos->first);
std::string::size_type tpos = term.find_last_of("/");
std::string::size_type lastPos = term.size();
while( tpos != std::string::npos ){
term.replace(tpos,lastPos-tpos,"");
tpos = term.find_last_of(" ",tpos-1);
lastPos = tpos;
tpos = term.find_last_of("/",tpos);
}
terms[term] += pos->second;
}
return terms;
}
/* **********************************************************************
* Get the nouns from the given text string
* ********************************************************************** */
std::map<std::string,int> get_nouns( const std::string& text ){
std::set<std::string> nouns;
nouns.insert("NN");
nouns.insert("NNP");
nouns.insert("NNS");
nouns.insert("NNPS");
return get_POS( text, nouns );
}
/* **********************************************************************
* Get the verbs from the given text string
* ********************************************************************** */
std::map<std::string,int> get_verbs( const std::string& text ){
std::set<std::string> verbs;
verbs.insert("VB");
verbs.insert("VBD");
verbs.insert("VBG");
verbs.insert("VBN");
verbs.insert("VBP");
verbs.insert("VBZ");
return get_POS( text, verbs );
}
/* **********************************************************************
* Get the adjectives from the given text string
* ********************************************************************** */
std::map<std::string,int> get_adjectives( const std::string& text ){
std::set<std::string> adjectives;
adjectives.insert("JJ");
adjectives.insert("JJS");
adjectives.insert("JJR");
return get_POS( text, adjectives );
}
/*%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#
%#
#% PRIVATE FUNCTIONS
%#
#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%#%*/
private:
std::string previous_tag;
std::string letters;
std::string upper_case;
std::string numbers;
typedef maps::ordered<std::string, unsigned long> LexiconValueMap;
typedef maps::unordered<std::string, LexiconValueMap> LexiconMap;
LexiconMap LEXICON;
LexiconMap HMM;
void init(){
reset();
numbers = "0123456789";
upper_case = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
letters = "abcdefghijklmnopqrstuvwxyz"+upper_case;
}
/* **********************************************************************
* Find MNPs within the given tagged text
* ********************************************************************** */
std::vector<std::pair<std::string::size_type,
std::string::size_type>
> find_mnp( const std::string tag_string ){
std::vector<std::pair<std::string::size_type,
std::string::size_type> > mnps;
std::string::size_type pos = tag_string.find_first_of("N",0);
while( pos != std::string::npos ){
std::string::size_type start = pos;
std::string::size_type end;
if( pos > 0 ){
std::string::size_type temp = tag_string.find_last_not_of("M",pos-1);
if( temp != std::string::npos && tag_string.substr(temp,1) == "D"){
start = temp;
} else {
start = temp + 1;
}
}
pos = tag_string.find_first_not_of("N",pos);
if( pos == std::string::npos ){
end = tag_string.size()-1;
} else {
end = pos - 1;
}
bool keep_going = true;
while( pos != std::string::npos &&
tag_string.substr(pos,1) != "X" &&
keep_going ){
pos = tag_string.find_first_not_of("P",pos);
std::string::size_type n = tag_string.find_first_not_of("A",pos) - pos;
if( n == 1 || n == 0 ){ // proceed
pos += n;
n = tag_string.find_first_not_of("D",pos) - pos;
if( n == 1 || n == 0 ){ // proceed
pos += n;
pos = tag_string.find_first_not_of("M",pos);
if( pos != std::string::npos &&
tag_string.substr(pos,1) == "N"){ // found Noun head
// go to end of N-head
pos = tag_string.find_first_not_of("N",pos);
if( pos == std::string::npos ){
end = tag_string.size()-1;
} else {
end = pos - 1;
}
} else {
keep_going = false;
}
} else {
keep_going = false;
}
} else {
keep_going = false;
}
}
mnps.push_back( std::make_pair(start,end-start+1));
pos = tag_string.find_first_of("N",pos);
}
return mnps;
}
/* **********************************************************************
* Used by `find_mnp`
* ********************************************************************** */
std::string simplify_tag( const std::string t ){
std::string s;
if( t.substr(0,2) == "NN"){
// noun
return "N";
} else if( t.substr(0,2) == "JJ" || t == "VBN" || t == "VBG"){
// modifier
return "M";
} else if( t == "IN" || t == "TO"){
// preposition
return "P";
} else if( t == "DT"){
// article
return "A";
} else if( t == "CD" ){
// number
return "D";
} else {
// some other tag
return "X";
}
}
/* **********************************************************************
* Used by `tokenize`
* ********************************************************************** */
std::string clean_punctuation( const std::string& s )
{
std::string tok(s);
std::string::size_type pos;
try {
pos = tok.find_first_of("([{}])!?#$;~|%",0);
while( pos != std::string::npos ){
tok.insert(pos+1," ");
tok.insert(pos," ");
pos = tok.find_first_of("([{}])!?#$;~|%", pos+3);
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (1): " << e.what() << std::endl;
}
try {
pos = tok.find("--",0);
while( pos != std::string::npos ){
while( pos+1 != std::string::npos && tok.substr(pos+1,1) == "-"){
tok.replace(pos+1,1,"");
}
tok.insert(pos+1," ");
tok.insert(pos," ");
pos = tok.find("--",pos+3);
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (2): " << e.what() << std::endl;
}
try {
pos = tok.find("...",0);
while( pos != std::string::npos ){
int length = 3;
while( pos+length != std::string::npos && tok.substr(pos+length,1) == "."){
length++;
}
tok.replace(pos,length,"...");
tok.insert(pos+3," ");
tok.insert(pos," ");
pos = tok.find("...",pos+3);
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (3): " << e.what() << std::endl;
}
try {
pos = tok.find_first_of( "`", 0);
while( pos != std::string::npos ){
if( tok.substr(pos+1,1) == "`"){
tok.insert(pos+2," ");
} else {
tok.insert(pos+1," ");
}
tok.insert(pos," ");
pos = tok.find_first_of("`",pos+3);
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (4): " << e.what() << std::endl;
}
try {
pos = tok.find( "''", 0);
while( pos != std::string::npos ){
if( pos+2 != std::string::npos ){
tok.insert(pos+2," ");
}
tok.insert(pos," ");
pos = tok.find("''",pos+2);
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (5): " << e.what() << std::endl;
}
try {
pos = tok.find_first_of(",",0);
while( pos != std::string::npos ){
std::string::size_type m_pos = tok.find_first_of(numbers,pos);
if( pos+1 != std::string::npos && m_pos == pos+1){
pos = tok.find_first_of(",",pos+1);
} else {
tok.insert(pos+1," ");
tok.insert(pos," ");
pos = tok.find_first_of(",", pos+3);
}
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (6): " << e.what() << std::endl;
}
try {
pos = tok.find_first_of("\"",0);
while( pos != std::string::npos){
if( tok.find_first_of(letters,pos+1) != std::string::npos){
tok.replace(pos,1," `` ");
} else {
tok.replace(pos,1," '' ");
}
pos = tok.find_first_of("\"", pos+4 );
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (7): " << e.what() << std::endl;
}
if( tok.substr(tok.size()-1,1) == ":"){
tok.insert(tok.size()-1," ");
}
if( tok.size() > 1 && tok.substr(tok.size()-1,1) == "."){
std::string::size_type m_pos = tok.find_last_not_of(letters,tok.size()-2);
if( m_pos < tok.size() - 2){
tok.insert(tok.size()-1," ");
}
}
try {
pos = tok.find_first_of("'",0);
while( pos != std::string::npos){
std::string next_char;
std::string prev_char;
bool prev_is_alphaNum = false;
bool next_is_alphaNum = false;
if( pos < tok.size()-1){
next_char = tok.substr(pos+1,1);
if( next_char.find_first_of(letters+numbers,0) != std::string::npos)
next_is_alphaNum=true;
}
if( pos > 0){
prev_char = tok.substr(pos-1,1);
if( prev_char.find_first_of(letters+numbers,0) != std::string::npos)
prev_is_alphaNum=true;
}
if( prev_char == "."){
tok.insert(pos," ");
pos = tok.find_first_of("'",pos+2);
} else if( pos < tok.size()-1 && tok.substr(pos+1,1) == "'") {
// next char is a single quote (')
pos = tok.find_first_of("'",pos+2);
} else if( next_is_alphaNum && prev_is_alphaNum ){
// next and prev char is alpha-numeric
// is it a contraction?
if( prev_char == "n" && next_char == "t" ){
tok.insert(pos-1," ");
pos = tok.find_first_of("'",pos+3);
} else if ( next_char == "d" || next_char == "m" || next_char == "s" ){
if( pos == tok.size()-2 ||
tok.find_first_not_of(letters+numbers,pos+2) == pos+2 ){
tok.insert(pos," ");
pos = tok.find_first_of("'",pos+3);
} else {
pos = tok.find_first_of("'",pos+2);
}
} else if ( tok.find("'ve", pos)==pos ||
tok.find("'ll",pos)==pos ||
tok.find("'re",pos)==pos){
tok.insert(pos," ");
pos = tok.find_first_of("'",pos+4);
} else {
pos = tok.find_first_of("'",pos+1);
}
} else if ( next_is_alphaNum && !prev_is_alphaNum ){ // begins word
tok.replace(pos,1," ` ");
pos = tok.find_first_of("'",pos+3);
} else if ( !next_is_alphaNum && prev_is_alphaNum ){ // ends word
tok.replace(pos,1," ' ");
pos = tok.find_first_of("'",pos+3);
} else {
pos = tok.find_first_of("'",pos+1);
}
}
} catch ( std::exception &e ) {
std::cerr << "Parsing Error (8): " << e.what() << std::endl;
}
boost::algorithm::trim(tok);
return tok;
}
/* **********************************************************************
* Shortcut for taking a tagged string and turning it
* back into two vectors of words and tags
* ********************************************************************** */
std::pair<std::vector<std::string>,
std::vector<std::string>
> vectorize_tagged_text(const std::string tagged){
std::vector<std::string> words;
std::vector<std::string> tags;
std::string::size_type word_start = 0;
std::string::size_type tag_end = tagged.find_first_of(" ",0);
std::string::size_type divider;
while( tag_end != std::string::npos ){
divider = tagged.find_last_of("/",tag_end);
words.push_back(tagged.substr(word_start,divider-word_start));
tags.push_back(tagged.substr(divider+1,tag_end-divider-1));
word_start = tagged.find_first_not_of(" ",tag_end);
tag_end = tagged.find_first_of(" ",word_start);
}
divider = tagged.find_last_of("/",tagged.size());
words.push_back(tagged.substr(word_start,divider-word_start));
tags.push_back(tagged.substr(divider+1,tagged.size()-divider-1));
return std::make_pair(words,tags);
}
/* *********************************************************************
* Load the lexicon into a 2-dimensional std::map
*********************************************************************** */
void load_lexicon( const std::string& filename="" )
{
if (!filename.empty()) {