Chapter 2. Usage/Quick Start

Table of Contents

Getting Started
Indexing Some Files
Performing a Basic Search
Going Further

Getting Started

The Semantic Engine is a header-only library. There is nothing to compile. Everything is inlined. In order to use the library, all you need to do is #include the necessary files and start using the library.

Indexing Some Files

In this section we will write a short program to index some files. These files will be textual in nature, and the index will be stored in an SQLite 3 database.

Procedure 1.1.  Example Program: Index Files

  1. First, we must include the necessary Semantic Engine files to do our indexing.

    #include <semantic/semantic.hpp>
    #include <semantic/indexing.hpp>
    #include <semantic/storage/sqlite3.hpp>
                        
  2. The next step is to construct a graph object, name the collection and point it to an SQLite 3 database file (which need not exist yet).

    We will omit the int main() function for brevity.

    typedef semantic::SEGraph<semantic::SQLite3StoragePolicy> MyGraph;
    
    MyGraph graph("test collection");
    graph.set_file("my_index.db");
                            
                        
  3. Next we will create the text indexing helper object, which will allow us to easily parse a block of text, apply some natural language processing on it, stem words and construct the graph. By default, all this happens automatically.

    typedef semantic::text_indexer<MyGraph> MyIndexer;
    
    MyIndexer indexer(graph, "/usr/local/share/semantic/tagger/lexicon.txt.gz");
                        

    The lexicon.txt.gz file should have been installed with the Semantic Engine library. You may supply your own lexicon (in another language, for example) or use the convenient LEXICON_INSTALL_LOCATION which is defined in semantic/config.hpp.

  4. Now comes the fun part. Let's index some files! Our loop here iterates through an imaginary std::vector of strings pointing to text files on our hard drive.

    g.set_mirror_changes_to_storage(true);
    
    for(std::vector<;std::string>::iterator it = my_filenames.begin();
                it != my_filesnames.end(); ++it) {
        indexer.index(*it);
    }
                            
    g.set_mirror_changes_to_storage(false);
    g.commit_changes_to_storage();
                            
                        

    Now you might be asking, what do the functions set_mirror_changes_to_storage() and commit_changes_to_storage() do? The answer is simple: the SEGraph object holds the index in memory. There are times when analysis is performed on the graph object and you don't want operations on the graph to be reflected in your on-disk index. By default the value of set_mirror_changes_to_storage() is false. By setting it to true the graph will begin keeping track of changes you make. When commit_changes_to_storage() is called, those changes will be written to disk. Depending on the amount of text indexed, this process can take a while (as a general rule, about 5 minutes per 15,000 paragraphs of text).

Performing a Basic Search

Now let's perform a basic search on our index. This requires the use of the SESubgraph class instead of SEGraph (now in a sense we're still using SEGraph as it is a superclass of SESubgraph). A search in the Semantic Engine is nothing more than extracting a subgraph of the main graph. Since indices become rather large rather quickly, it is a big time-saver to work with subgraphs.

Procedure 1.2. Example Program: Basic Search

  1. Like in our previous example, let's start with the includes.

    #include <semantic/subgraph.hpp>
    #include <semantic/search.hpp>
    
    // as per normal, we need a storage policy -- this will allow us access to our index
    #incldue <semantic/storage/sqlite3.hpp>
                            
    // subgraphs are built by using a subgraph policy
    #include <semantic/subgraph/pruning_random_walk.hpp>
                            
    // and a weighting policy is necessary for some subgraph policies, as well
    // as for ranking operations
    #include <semantic/weighting/lg.hpp>
    #include <semantic/weighting/tf.hpp>
    #include <semantic/weighting/idf.hpp>
                            
                        

    The three weighting policy files we will use to combine into one weighting policy. lg.hpp defines Local/Global weighting, tf.hpp defines Term Frequency and idf.hpp defines Inverse Document Frequency. We will combine these to create TF/IDF weighting.

  2. It's on to creating a new SESubgraph object and performing a search. The search will be from a textual query, and will populate the subgraph with nodes and edges.

    Again, we will omit the int main() function.

    typedef semantic::SESubgraph<
        semantic::SQLite3StoragePolicy, 
        semantic::PruningRandomWalkSubgraph, 
        semantic::LGWeighting<semantic::TFWeighting, semantic::IDFWeighting>
        > MySubgraph;
    typedef semantic::search<MySubgraph> MySearchHelper;
                            
    MySubgraph graph("test collection");
    graph.set_file("my_index.db");
    
    MySearchHelper search_helper(graph);
    
    std::vector<std::pair<std::string, double> >
        my_documents, my_terms;
    
    boost::tie(my_documents, my_terms) = search_helper.semantic("my query string");
                                    
                        

    In case you are wondering, boost::tie() allows you to set the first parameter to the value of the first element of a std::pair and the second to the second. search_helper.semantic() returns a std::pair of the above std::vectors. The results returned by our helper will be sorted.

  3. Now, let's just print out our results and we're done!

    std::cout << "Top 10 terms: ";
    for(unsigned int i = 0; i < 10 &amp;&amp; i < my_terms.size(); i++)
        std::cout << my_terms[i].first << " ";
    std::cout << std::endl;
    std::cout << "Results:" << std::endl;
    for(unsigned int i = 0; i < my_documents.size(); i++)
        std::cout << "Rank: " << my_documents[i].second << ", Document: " << my_documents[i].first << std::endl;
                            
                        

Going Further

The Semantic Engine is capable of far more than mere indexing and searching. Further information can be found in ??? and ???.