mlpack  master
PSpectrumStringKernel Class Reference

The p-spectrum string kernel. More...

Public Member Functions

 PSpectrumStringKernel (const std::vector< std::vector< std::string > > &datasets, const size_t p)
 Initialize the PSpectrumStringKernel with the given string datasets. More...

 
const std::vector< std::vector< std::map< std::string, int > > > & Counts () const
 Access the lists of substrings. More...

 
std::vector< std::vector< std::map< std::string, int > > > & Counts ()
 Modify the lists of substrings. More...

 
template
<
typename
VecType
>
double Evaluate (const VecType &a, const VecType &b) const
 Evaluate the kernel for the string indices given. More...

 
size_t P () const
 Access the value of p. More...

 
size_t & P ()
 Modify the value of p. More...

 

Detailed Description

The p-spectrum string kernel.

Given a length p, the p-spectrum kernel finds the contiguous subsequence match count between two strings. The kernel will take every possible substring of length p of one string and count how many times it appears in the other string.

The string kernel, when created, must be passed a reference to a series of string datasets (std::vector<std::vector<std::string> >&). This is because mlpack only supports datasets which are Armadillo matrices – and a dataset of variable-length strings cannot be easily cast into an Armadillo matrix.

Therefore, once the PSpectrumStringKernel is created with a reference to the string datasets, a "fake" Armadillo data matrix must be created, which simply holds indices to the strings they represent. This "fake" matrix has two rows and n columns (where n is the number of strings in the dataset). The first row holds the index of the dataset (remember, the kernel can have multiple datasets), and the second row holds the index of the string. A fake matrix containing only strings from dataset 0 might look like this:

[[0 0 0 0 0 0 0 0 0] [0 1 2 3 4 5 6 7 8]]

This fake matrix is then given to the machine learning method, which will eventually call PSpectrumStringKernel::Evaluate(a, b), where a and b are two columns of the fake matrix. The string kernel will then map these fake columns back to the strings they represent, and then correctly evaluate the kernel.

Unfortunately, not every machine learning method will work with this kernel. Only machine learning methods which do not ever operate on the explicit representation of points can use this kernel. So, for instance, one cannot build a kd-tree on strings, because the BinarySpaceTree<> class will split the data according to the fake data matrix – resulting in a meaningless tree. This kernel was originally written for the FastMKS method; so, at the very least, it will work with that.

Definition at line 65 of file pspectrum_string_kernel.hpp.

Constructor & Destructor Documentation

◆ PSpectrumStringKernel()

PSpectrumStringKernel ( const std::vector< std::vector< std::string > > &  datasets,
const size_t  p 
)

Initialize the PSpectrumStringKernel with the given string datasets.

For more information on this, see the general class documentation.

Parameters
datasetsSets of string data.
pThe length of substrings to search.

Member Function Documentation

◆ Counts() [1/2]

const std::vector<std::vector<std::map<std::string, int> > >& Counts ( ) const
inline

Access the lists of substrings.

Definition at line 93 of file pspectrum_string_kernel.hpp.

◆ Counts() [2/2]

std::vector<std::vector<std::map<std::string, int> > >& Counts ( )
inline

Modify the lists of substrings.

Definition at line 96 of file pspectrum_string_kernel.hpp.

◆ Evaluate()

double Evaluate ( const VecType &  a,
const VecType &  b 
) const

Evaluate the kernel for the string indices given.

As mentioned in the class documentation, a and b should be 2-element vectors, where the first element contains the index of the dataset and the second element contains the index of the string. Therefore, if [2 3] is passed for a, the string used will be datasets[2][3] (datasets is of type std::vector<std::vector<std::string> >&).

Parameters
aIndex of string and dataset for first string.
bIndex of string and dataset for second string.

◆ P() [1/2]

size_t P ( ) const
inline

Access the value of p.

Definition at line 100 of file pspectrum_string_kernel.hpp.

◆ P() [2/2]

size_t& P ( )
inline

Modify the value of p.

Definition at line 102 of file pspectrum_string_kernel.hpp.


The documentation for this class was generated from the following file: