Name Last modified Size Description
Parent Directory -
README.html 10-Dec-2007 22:28 13K
msvcp71.dll 10-Dec-2007 22:28 488K
msvcr71.dll 10-Dec-2007 22:28 340K
pytst-1.12-boost.python-win32-py2.4.zip 10-Dec-2007 22:28 137K
pytst-1.12.win32-py2.3.exe 10-Dec-2007 22:28 223K
pytst-1.12.win32-py2.4.exe 10-Dec-2007 22:28 110K
pytst-1.12.zip 10-Dec-2007 22:28 80K
pytst-1.14.win32-py2.3.exe 10-Dec-2007 22:28 113K
pytst-1.14.win32-py2.4.exe 10-Dec-2007 22:28 114K
pytst-1.14.win32-py2.5.exe 10-Dec-2007 22:28 115K
pytst-1.14.zip 10-Dec-2007 22:28 84K
pytst-1.15.win32-py2.3.exe 02-Apr-2008 01:17 226K
pytst-1.15.win32-py2.4.exe 02-Apr-2008 01:17 116K
pytst-1.15.win32-py2.5.exe 02-Apr-2008 01:19 116K
pytst-1.15.zip 02-Apr-2008 01:17 86K
© Nicolas Lehuen
2004-2005
This work is released under the LGPL
license - see the LICENSE file for
more information.
As of 2005/12/30, this documentation is a bit outdated. I'll try to refresh it now that the API is stabilized...
As promised, I've released a very preliminary version of pytst, a Ternary Search Tree (TST or trie) implementation in C++ with a Python interface (built with SWIG). Download it here if you dare !
Basically, it behaves like a dictionary, but the keys can only be plain strings (sorry, not Unicode strings yet). So why bother ? Because TSTs are a lot smarter than dictionaries when it comes to :
I have been using this package in production for nearly a year now, without any problem (except a surprise due to SWIG directors when I switched to Python 2.4, but this is fixed now). I release this under the LGPL. Incidentally, this is my first release under a license, up until now I was pretty much releasing my dirty works in the public domain where it could safely be ignored :). I am NOT satisfied with the packaging (especially the tests), the code layout (it's been so long I had not been writing C++ that I forgot a whole bunch of coding conventions) and so on, but I decided to release this as is, and see what happens.
>>> import tstA TST instance behaves a bit like a Python dict, but the keys can only be strings, more precisely str instances. Sorry, Unicode is not supported now, mostly because I haven't found a way yet to have SWIG handle Unicode. If you want to use Unicode strings as keys, you have to consistently encode them before storing them, preferably in UTF8 since it has the nice property of not breaking the measurement of Levenshtein distances :
>>> t = tst.TST()
>>> t['foo']='bar'
>>> print t['foo']
bar
>>> print t['fo']
None
>>> t['foo']='baz'
>>> print t['foo']
baz
>>> print t[1]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "tst.py", line 354, in __getitem__
def __getitem__(*args): return _tst.TST___getitem__(*args)
TypeError: argument number 2: a 'char *' is expected, 'int(1)' is received
>>> import tstTalk about a crash course (wink wink, nudge nudge). Oh, yes, it's dirty, but the nice thing about the process being killed is that it saves you from the mess that would result of mixing byte strings and Unicode strings in the same TST.
>>> t = tst.TST()
>>> t[u'café'.encode('UTF8')]='java'
>>> print t[u'café'.encode('UTF8')]
java
>>> print t['caf\xc3\xa9']
java
>>> print t[u'café']
(the process dies unexpectedly)
>>> import tstThe tokenizing algorithm is greedy : it will try to produce the longest tokens available, consuming the characters as they come. That's why the token 4 was not produced in the above example : the sequence '123456' was already consumed. This may be a big limitation for some usages ; it means that the patterns you are looking for can only be recognized from their prefix, not from their suffix. ( Maybe that's not how ribosoms parse the DNA (more exactly, the mRNA), so this limitation may render the algorithm useless for DNA sequence analysis, after all. Anyway... )
>>> t = tst.TST()
>>> t['1234']='token 1'
>>> t['123456']='token 2'
>>> t['45678']='token 3'
>>> t['5678910']='token 4'
>>> result = t.scan('1234561234567891012345',tst.TupleListAction())
>>> print result
[('123456', 6, 'token 2'), ('123456', 6, 'token 2'), ('78910', -5, None), ('1234', 4, 'token 1'), ('5', -1, None)]
Here you are, you have built a transcriber that could handle millions of transcribing rules very efficiently. Well, that's not totally true, since you first have to parse the string to obtain a list of tokens, then iterate on the list of tokens to write something. Why not produce the output while we're scanning the input string ? That's why the tst module has other action callbacks.
>>> import sys
>>> for matched_string, matched_length, matched_object in result:
... if matched_length>0:
... sys.stdout.write(matched_object)
... else:
... sys.stdout.write(print matched_string)
... else:
token 2token 278910token 15
>>> def mycallback(key,length,obj):The CallableAction callback class allows you to pass two callables : the first one is called for each match or non-match, with exactly the same arguments as the one that were found in the tuples built by TupleListAction. The second one is called at the end of the scan, without any arguments. Its return value is returned by the scan method, which allows you to write shortcuts like :
... if length>0:
... sys.stdout.write(obj)
... else:
... sys.stdout.write(key)
...
>>> def myresult():
...
>>> t.scan('1234561234567891012345',tst.CallableAction(mycallback,myresult))
token 2token 278910token 15
>>> class Counter(object):You can even subclass tcc.CallableAction for more compacity. For example, you could reimplement TupleListAction like this :
... def __init__(self):
... self.counter = 0
... def hit(self,key,length,obj):
... if length>0:
... self.counter += 1
... def result(self):
... return 'I found %i hits'%self.counter
...
>>> c = Counter()
>>> print t.scan('1234561234567891012345',tst.CallableAction(c.hit,c.result))
I found 3 hits
>>> class MyTupleListAction(tst.CallableAction):Of course this version is less efficient than the original tst.TupleListAction which saves one or two layer of wrappers by going directly from the TST C++ API to the Python C API.
... def __init__(self):
... tst.CallableAction.__init__(self,self.callback,self.result)
... self.list = []
... def callback(self,key,length,obj):
... self.list.append((key,length,obj))
... def result(self):
... return self.list
...
>>> print t.scan('1234561234567891012345',MyTupleListAction())
[('123456', 6, 'token 2'), ('123456', 6, 'token 2'), ('78910', -5, None), ('1234', 4, 'token 1'), ('5', -1, None)]