The Semantic Engine is a header-only library. There is nothing
to compile. Everything is inlined. In order to use the library, all you need
to do is #include the necessary files and start using the library.
In this section we will write a short program to index some files. These files will be textual in nature, and the index will be stored in an SQLite 3 database.
Procedure 1.1. Example Program: Index Files
First, we must include the necessary Semantic Engine files to do our indexing.
#include <semantic/semantic.hpp>
#include <semantic/indexing.hpp>
#include <semantic/storage/sqlite3.hpp>
The next step is to construct a graph object, name the collection and point it to an SQLite 3 database file (which need not exist yet).
We will omit the int main() function for brevity.
typedef semantic::SEGraph<semantic::SQLite3StoragePolicy> MyGraph;
MyGraph graph("test collection");
graph.set_file("my_index.db");
Next we will create the text indexing helper object, which will allow us to easily parse a block of text, apply some natural language processing on it, stem words and construct the graph. By default, all this happens automatically.
typedef semantic::text_indexer<MyGraph> MyIndexer;
MyIndexer indexer(graph, "/usr/local/share/semantic/tagger/lexicon.txt.gz");
The lexicon.txt.gz file should have been installed with the
Semantic Engine library. You may supply your own
lexicon (in another language, for example) or use the convenient
LEXICON_INSTALL_LOCATION which is defined in semantic/config.hpp.
Now comes the fun part. Let's index some files! Our loop here iterates through
an imaginary std::vector of strings pointing to text files on our
hard drive.
g.set_mirror_changes_to_storage(true);
for(std::vector<;std::string>::iterator it = my_filenames.begin();
it != my_filesnames.end(); ++it) {
indexer.index(*it);
}
g.set_mirror_changes_to_storage(false);
g.commit_changes_to_storage();
Now you might be asking, what do the functions set_mirror_changes_to_storage()
and commit_changes_to_storage() do? The answer is simple: the SEGraph
object holds the index in memory. There are times when analysis is performed on the graph object
and you don't want operations on the graph to be reflected in your on-disk index. By default
the value of set_mirror_changes_to_storage() is false. By setting
it to true the graph will begin keeping track of changes you make. When
commit_changes_to_storage() is called, those changes will be written to disk. Depending
on the amount of text indexed, this process can take a while (as a general rule, about 5 minutes
per 15,000 paragraphs of text).
Now let's perform a basic search on our index. This requires the use of the SESubgraph
class instead of SEGraph (now in a sense we're still using SEGraph as it is a
superclass of SESubgraph). A search in the Semantic Engine is nothing
more than extracting a subgraph of the main graph. Since indices become rather large rather quickly, it is
a big time-saver to work with subgraphs.
Procedure 1.2. Example Program: Basic Search
Like in our previous example, let's start with the includes.
#include <semantic/subgraph.hpp>
#include <semantic/search.hpp>
// as per normal, we need a storage policy -- this will allow us access to our index
#incldue <semantic/storage/sqlite3.hpp>
// subgraphs are built by using a subgraph policy
#include <semantic/subgraph/pruning_random_walk.hpp>
// and a weighting policy is necessary for some subgraph policies, as well
// as for ranking operations
#include <semantic/weighting/lg.hpp>
#include <semantic/weighting/tf.hpp>
#include <semantic/weighting/idf.hpp>
The three weighting policy files we will use to combine into one weighting policy.
lg.hpp defines Local/Global weighting, tf.hpp
defines Term Frequency and idf.hpp defines
Inverse Document Frequency. We will combine these to create TF/IDF weighting.
It's on to creating a new SESubgraph object and performing a search. The search will
be from a textual query, and will populate the subgraph with nodes and edges.
Again, we will omit the int main() function.
typedef semantic::SESubgraph<
semantic::SQLite3StoragePolicy,
semantic::PruningRandomWalkSubgraph,
semantic::LGWeighting<semantic::TFWeighting, semantic::IDFWeighting>
> MySubgraph;
typedef semantic::search<MySubgraph> MySearchHelper;
MySubgraph graph("test collection");
graph.set_file("my_index.db");
MySearchHelper search_helper(graph);
std::vector<std::pair<std::string, double> >
my_documents, my_terms;
boost::tie(my_documents, my_terms) = search_helper.semantic("my query string");
In case you are wondering, boost::tie() allows you to set the first parameter
to the value of the first element of a std::pair and the second to the second.
search_helper.semantic() returns a std::pair of the above
std::vectors. The results returned by our helper will be sorted.
Now, let's just print out our results and we're done!
std::cout << "Top 10 terms: ";
for(unsigned int i = 0; i < 10 && i < my_terms.size(); i++)
std::cout << my_terms[i].first << " ";
std::cout << std::endl;
std::cout << "Results:" << std::endl;
for(unsigned int i = 0; i < my_documents.size(); i++)
std::cout << "Rank: " << my_documents[i].second << ", Document: " << my_documents[i].first << std::endl;