Index of /nicolas/download/pytst

Icon  Name                                    Last modified      Size  Description
[DIR] Parent Directory - [TXT] README.html 10-Dec-2007 22:28 13K [   ] msvcp71.dll 10-Dec-2007 22:28 488K [   ] msvcr71.dll 10-Dec-2007 22:28 340K [   ] pytst-1.12-boost.python-win32-py2.4.zip 10-Dec-2007 22:28 137K [   ] pytst-1.12.win32-py2.3.exe 10-Dec-2007 22:28 223K [   ] pytst-1.12.win32-py2.4.exe 10-Dec-2007 22:28 110K [   ] pytst-1.12.zip 10-Dec-2007 22:28 80K [   ] pytst-1.14.win32-py2.3.exe 10-Dec-2007 22:28 113K [   ] pytst-1.14.win32-py2.4.exe 10-Dec-2007 22:28 114K [   ] pytst-1.14.win32-py2.5.exe 10-Dec-2007 22:28 115K [   ] pytst-1.14.zip 10-Dec-2007 22:28 84K [   ] pytst-1.15.win32-py2.3.exe 02-Apr-2008 01:17 226K [   ] pytst-1.15.win32-py2.4.exe 02-Apr-2008 01:17 116K [   ] pytst-1.15.win32-py2.5.exe 02-Apr-2008 01:19 116K [   ] pytst-1.15.zip 02-Apr-2008 01:17 86K
pytst 1.00 README

pytst 1.00 README

© Nicolas Lehuen 2004-2005
This work is released under the LGPL license - see the LICENSE file for more information.

Note

As of 2005/12/30, this documentation is a bit outdated. I'll try to refresh it now that the API is stabilized...

Introduction

As promised, I've released a very preliminary version of pytst, a Ternary Search Tree (TST or trie) implementation in C++ with a Python interface (built with SWIG). Download it here if you dare !

Basically, it behaves like a dictionary, but the keys can only be plain strings (sorry, not Unicode strings yet). So why bother ? Because TSTs are a lot smarter than dictionaries when it comes to :

I have been using this package in production for nearly a year now, without any problem (except a surprise due to SWIG directors when I switched to Python 2.4, but this is fixed now). I release this under the LGPL. Incidentally, this is my first release under a license, up until now I was pretty much releasing my dirty works in the public domain where it could safely be ignored :). I am NOT satisfied with the packaging (especially the tests), the code layout (it's been so long I had not been writing C++ that I forgot a whole bunch of coding conventions) and so on, but I decided to release this as is, and see what happens.

Requirements and installation

First of all, get the files in the download directory. If you're lucky, there is a binary installer right for you environment (I usually provide builds for Win32). If not, you're going to build the module from the sources.

pytst is a standard Python module, with a classic setup.py install installation procedure. However, the native part of the module is written in C++, so your mileage may vary. I have successfully built the module on those environments :
If you manage to build pytst on any other environment, send me a mail so that I can add it to this list ! Even better, we could add the binaries to the downloads directory.

Running the tests

Well, for now, the test are really crappy. I only have one script using unittest, the other ones don't say anything  about the test passing or failing. So for now, forget about running the test and go straight to the...

Ternary Search Tree crash course

OK, so first, the basics :
>>> import tst
>>> t = tst.TST()
>>> t['foo']='bar'
>>> print t['foo']
bar
>>> print t['fo']
None
>>> t['foo']='baz'
>>> print t['foo']
baz
>>> print t[1]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "tst.py", line 354, in __getitem__
def __getitem__(*args): return _tst.TST___getitem__(*args)
TypeError: argument number 2: a 'char *' is expected, 'int(1)' is received
A TST instance behaves a bit like a Python dict, but the keys can only be strings, more precisely str instances. Sorry, Unicode is not supported now, mostly because I haven't found a way yet to have SWIG handle Unicode. If  you want to use Unicode strings as keys, you have to consistently encode them before storing them, preferably in UTF8 since it has the nice property of not breaking the measurement of Levenshtein distances :
>>> import tst
>>> t = tst.TST()
>>> t[u'café'.encode('UTF8')]='java'
>>> print t[u'café'.encode('UTF8')]
java
>>> print t['caf\xc3\xa9']
java
>>> print t[u'café']
(the process dies unexpectedly)
Talk about a crash course (wink wink, nudge nudge). Oh, yes, it's dirty, but the nice thing about the process being killed is that it saves you from the mess that would result of mixing byte strings and Unicode strings in the same TST.

So, what have we learned so far ? pytst is a kind of dictionary, which handles string keys only, and crash in a not so nice way when given Unicode strings. However, if you read the introduction carefuly, you know that there is more to ternary search trees than this.

Using a TST to tokenize a string

The first usage you can make of a ternary search tree is to use it to tokenize a string. Why would you use a TST when you could use one of the thousand lexers you can find on the Internet ? Well, try building a ruleset for your favorite lexer with a thousand different token definitions. Now try this with ten thousands, a hundred thousands, a million token definition. The lexer will explode, whereas a TST will not.

Clearly, the features of the TST scanning algorithm are much less flexible (don't try to compare them to what you can do with regular expressions, for example), but it scales a lot well. Why would you want to handle a million different token definition ? Well, that's up to you, but DNA sequences parsing comes to mind.

How does the scanning algorithm scales ? Well, like the Aho-Corasick algorithm. The algorithm reads each character from the input string only once, then does its little dance within the TST data structures. Believe me, this is fast, as described here.

Now let's see a little bit of code. You initialize the TST with all the tokens you are looking for, then launch the scan :
>>> import tst
>>> t = tst.TST()
>>> t['1234']='token 1'
>>> t['123456']='token 2'
>>> t['45678']='token 3'
>>> t['5678910']='token 4'
>>> result = t.scan('1234561234567891012345',tst.TupleListAction())
>>> print result
[('123456', 6, 'token 2'), ('123456', 6, 'token 2'), ('78910', -5, None), ('1234', 4, 'token 1'), ('5', -1, None)]
The tokenizing algorithm is greedy : it will try to produce the longest  tokens available, consuming the characters as they come. That's why the token 4 was not produced in the above example : the sequence '123456' was already consumed. This may be a big limitation for some usages ; it means that the patterns you are looking for can only be recognized from their prefix, not from their suffix. ( Maybe that's not how ribosoms parse the DNA (more exactly, the mRNA), so this limitation may render the algorithm useless for DNA sequence analysis, after all. Anyway... )

The TupleListAction callback class

What about tst.TupleListAction() ? This is our first example of giving a callback to the TST so that it can perform an action each time a match is found. TupleListAction just accumulate all the matches in a list, each match being stored as a tuple. The first item of the tuple is either a key from the TST, or a substring from the scanned string. If it is a key, this means a match has been found, so the second item is the key length and the third item is the object associated to the key in the TST. If the first item is a substring from the scanned string, the second item is the substring length with a negative sign, and the third item is None. Typically, the result object can be used like this :

>>> import sys
>>> for matched_string, matched_length, matched_object in result:
... if matched_length>0:
... sys.stdout.write(matched_object)
... else:
... sys.stdout.write(print matched_string)
... else:
... print
token 2token 278910token 15

Here you are, you have built a transcriber that could handle millions of transcribing rules very efficiently.  Well, that's not totally true, since you first have to parse the string to obtain a list of tokens, then iterate on the list of tokens to write something. Why not produce the output while we're scanning the input string ? That's why the tst module has other action callbacks.

The CallableAction callback class

>>> def mycallback(key,length,obj):
...     if length>0:
...     sys.stdout.write(obj)
...     else:
...     sys.stdout.write(key)
...
>>> def myresult():
...     print
...
>>> t.scan('1234561234567891012345',tst.CallableAction(mycallback,myresult))
token 2token 278910token 15
The CallableAction callback class allows you to pass two callables : the first one is called for each match or non-match, with exactly the same arguments as the one that were found in the tuples built by TupleListAction. The second one is called at the end of the scan, without any arguments. Its return value is returned by the scan method, which allows you to write shortcuts like :
>>> class Counter(object):
... def __init__(self):
... self.counter = 0
... def hit(self,key,length,obj):
...     if length>0:
...     self.counter += 1
... def result(self):
... return 'I found %i hits'%self.counter
...
>>> c = Counter()
>>> print t.scan('1234561234567891012345',tst.CallableAction(c.hit,c.result))
I found 3 hits
You can even subclass tcc.CallableAction for more compacity. For example, you could reimplement TupleListAction like this :
>>> class MyTupleListAction(tst.CallableAction):
... def __init__(self):
... tst.CallableAction.__init__(self,self.callback,self.result)
... self.list = []
... def callback(self,key,length,obj):
... self.list.append((key,length,obj))
... def result(self):
... return self.list
...
>>> print t.scan('1234561234567891012345',MyTupleListAction())
[('123456', 6, 'token 2'), ('123456', 6, 'token 2'), ('78910', -5, None), ('1234', 4, 'token 1'), ('5', -1, None)]
Of course this version is less efficient than the original tst.TupleListAction which saves one or two layer of wrappers by going directly from the TST C++ API to the Python C API.

The DictAction and ListAction callback classes

TBD...