Nicolas Lehuen's Weblog

To content | To menu | To search

Wednesday, April 2 2008

pytst 1.15

I've just released pytst 1.15, you can download it there.

This release fixes a quite important problem introduced on 2006/05/05... For a reason I don't quite remember now, I decided to make the TST C++ template specialization dedicated to the Python bindings privately inherit from the more generic tst<...> template. I guess this was related to some problems I had with SWIG.

The problem is that this broke something in the way SWIG handled inheritance between the two types, the net result being that the Python TST class lost its prefix_match methods and other methods that were declared in the parent class but not in the specialization.

I've reverted the inheritance between the two C++ classes back to public inheritance, and everything is back to normal now.

I really wished I could spend more time working on ctst, since it features theoretically better structures and algorithms, plus to be frank I'm quite fed up with C++, and I'd be more than happy to go back to C.

Also, I must say that the native API for Ruby is very impressive. I've been developing the Ruby bindings for ctst by hand, not using SWIG, and I've spent substantially less time than it took to have SWIG properly handle my C++ code from pytst...

Monday, October 9 2006

I wish I learned Japanese

When I was young (I'm talking days of yore, here, about ten years ago) and innocent, I was an engineering student at the INSA (Institut National des Sciences Appliquées, meaning National Institute for Applied Sciences). I was there for 5 years, and two of the best teachers I had were not math or physics teachers, but English teachers. Their courses were very important to me because they not only taught English, but also how to make a clear presentation and manage meetings. I'm using this every day, now. If I compare this to the tons of math I've learned (sometimes at 30 hours a week doses, the rest being computer science), which I'm using at somewhere near 5% now, the yield of those English courses is clearly superior. Mr. and Mrs Souillard, thank you ! I hope my awkward writing won't be a disgrace to your teaching standards.

Anyway, Mr. Souillard also taught optional Japanese lessons, but I've decided not to follow them, because, you know, I had much more interesting things to do, like more maths or physics (and, to be comprehensive, drink booze, play video games and make out with girls).

So, it looks like there is a justice, after all, because my pytst code has been noticed by some Japanese people, and I can't understand what they write about it. Masaki Yatsu contacted me today to announce that he ported pytst to Ruby, apparently using SWIG. He also added support for multibyte character sets, which is obviously pretty important in Japanese. This is the part I've understood, because he wrote me in English. Unfortunately, his blog is in Japanese. Check out the Google Translation result, if you want to have some fun. I can barely understand 25% of what Masaki writes (and to be clear, it's only my fault :).

As for odz's blog entry, it was much worse, because I couldn't get whether he had better or worse results using pytst versus Perl regexp (it turns out my code performs ten times better for his use case, yay!).

I'm not sure one or two years of learning Japanese would have helped me much in this situation, though. I guess in the best case I would have the same level of understanding that Google Translation gives me. But at least I wouldn't feel such a dork, instead I would be bragging about my deep knowledge of kanji.

Anyway, it's good to see one's code reused by people from the other side of the planet, with different alphabets and character encodings, apparently without too many problems. I'm getting a warm and fuzzy feeling from that :).

Incidentally, since I've posted a comment on odz's blog, some spam crawler has fetched my email address, and I now receive spam in japanese. Wonderful. My bayesian anti-spam filter is so happy to get all those new words to learn ! Actually it's not.

Saturday, October 7 2006

pytst 1.14 for Python 2.3, 2.4 and 2.5

It's been a few months since my last release of an official version of pytst. Seeing that someone in Japan uses it made me have a look at the project once again. If I remember correctly, the latest evolutions where mostly refactorings in the hope that I could come up with on-disk storage of the tree, plus a few experiment on my full text indexing system based on the TST (which I won't release yet). So, I'm releasing the latest version, having made sure that it compiled correctly for Python 2.3, 2.4 and 2.5 on Win32 and Ubuntu Linux (I only tested Python 2.4 on Linux)

I've been pretty busy lately, what with the fact that I now have TWO regular jobs (both theoretically being at 50%) : I'm still CTO at The CRM Company, but I am now VP Product & Marketing (and developer :) ) at Yoono on top of that. Interesting times...

My work on TSTs has been used at Yoono ; my friend (and now boss) Laurent Quérel and I have gone a bit further with TSTs and came up with a kind of TST which can store multiple characters per tree node. Any time some part of a prefix can be shared by multiple keys, the part is fully stored in the node and not duplicated, thus saving a lot of unnecessary nodes and navigation. Though it surely exists somewhere, we haven't seen it anywhere else, so we decided to simply call this kind of TST a CompactTST. What's interesting in this TST is that it has a very low memory overhead : the indexing structure only adds 10% to the bulk volume of the data (string key + integer value in our case) it contains.

As I've wrote previously, my work on Yoono made me switch back to writing Java. Hence, the CompactTST is written in Java, but with a low-level edge : like I've did in C++ for pytst, we didn't want to build a Java instance per tree node. Our solution is a bit rough but it does work perfectly : we store nodes in a byte[], manually managing the layout of nodes and the storage of their data in the byte array. The result is pretty fast, very compact and the GC doesn't work at all.

This gave me the idea of reusing this method to store a big number of objects in memory. Now that we have 64 bit servers with 8 Gb memory, why not make the most of it ? So currently I'm on my way to storing tens of millions of objects in memory through my own memory allocator system, implemented on top of multiple byte arrays, without any problem caused by memory or GC overhead. Of course, this means that I can't rely on the GC anymore, but it's the whole point anyway. We now have a way to store millions of String, int, List<Integer> and List<String> in an efficient way. I'm sure die hard Java developers would cringe at what we are doing, but it's fun and worthwhile.

I'm still planning on writing pytst 2.0, this time with a proper storage layer which will make me able to switch from memory-based to disk-based storage. Now that we have a neat CompactTST in Java, I'll take the opportunity of porting the algorithms to C++ at the same time. It's only a matter of finding the time to do it, which is easier said than done...

Monday, March 20 2006

pytst 1.09

This release, available here as usual, adds the match method, which supports the glob-like wildcards ? and *. For instance, my_tree.match("a*?d",None,DictAction()) will give you matches for "added" and "abcd" and even "and" but not for "ad".

Sunday, March 5 2006

pytst 1.08 : SWIG is back in the game

This release uses SWIG 1.3.28. The latest version of SWIG brings huge improvements in performance, bringing it on par with Boost.Python according to my crude benchmarks. The advantage of SWIG is that it is much more easy to build, being supported by the distutils package (though I had problems with C++ support). You don't have to learn a new set of building tricks like when using Boost. So, it seems that SWIG is back in the game !

Big thanks to the SWIG development team !

Update : Using the -O optimizing option in SWIG 1.3.28 yields even better performance, thus making the SWIG wrapper effectvely faster than the Boost.Python one. I don't know what happens under the hood though, but at least I can read the generated code, contrary to what happens in Boost. My unit and torture tests behave as before, so it seems everything is OK ! Great !

Thursday, February 23 2006

pytst 1.07

The pytst 1.07 distribution improves vastly the packing algorithm. Turns out my previous so-called "clever algorithm" was running in O(n2), which is as bad as it can get. Well, no, it could have been O(2n), but still. The new algorithm runs in O(n (3+log n)). I'm thinking about a better one, but I just couldn't leave things as is until I come up with the final algorithm.

Warning : I've just checked the Python 2.3 build, and GCC 3.4.2 (mingw-special) chokes on the most stupid piece of code ever :

#include <exception>

class TSTException : public exception {
public:
    TSTException(const char* _message) : message(_message) {
    }

    const char* what() {
        return message;
    }

private:
    const char* message;
};

GCC complains that exception is not a known class name. Doh ! I have to get some sleep now, I'll see this later.

Saturday, February 18 2006

pytst 1.06

I've made a lot of internal changes and improvements to the C++ part of pytst in order to implement an incremental full text index with both Python and COM bindings.

The only visible change of the pytst 1.0.6 distribution is a more compact serialisation format. It's not top notch compression, but it saves about 25% of disk space.

For top notch compression, I'm currently reading Managing Gigabytes, an excellent book about compression and documents (text and images) indexation. It's very interesting and pretty inspirative. It's rather old, though, for example the index mentions Excite and Altavista but not the small but promising Google search engine. But it's a theory book and most of the theory hasn't changed in 10 years, of course.

Friday, January 6 2006

pytst 1.01

Download it here.

What's new ?

  • There is at last a nice test suite which I use to ensure that there are no regression and that both the SWIG version and the Boost.Python version are in sync. The test suite also records performance data in a CSV file which I can then analyse with Excel to check if there is a performance improvement or regression when I introduce something.
  • A creeping bug in close_match is definitely fixed. I always knew it was here but could not properly reproduce it. Luckily with the new test suite I was able to track it.
  • WARNING : there is a change in the DictAction collector : it now retains the lowest distance whenever a key,value pair is seen multiple times, for instance in close_match. The close_match algorithm now stores distances (the lower the better) rather than remaining distances (the higher the better). For instance, if you were looking for all entries with a maximum distance of 1 from "123", you will now get "122" with a distance of 1 instead of a remaining distance of 0. This is much more intuitive than before.
  • Speaking about collectors, you know, those ListAction, TupleListAction, DictAction classes ? Well, those are not the only way to collect data from the TST. There are now true Python iterators ! Instead of using walk or close_match, you can use the default iterator, the iterator([root]) method or the close_match_iterator(string,max_distance) method.

Examples of iterator usage :

from tst import *

t = tst.TST()
# ... filling t...

# BEFORE :
contents = t.walk(None,DictAction())
# NOW :
content = {}
for key, value in t:
    content[key] = value

# BEFORE :
# All words beginning by "foo" :
from_foo = t.walk(None,DictAction(),"foo")
# NOW :
from_foo = {}
for key, value in r.iterator("foo"):
    from_foo[key]=value

# BEFORE :
# All words close to "foo" :
close_to_foo = t.close_match("foo",1,None,DictAction())
# NOW :
close_to_foo = {}
for key, value in r.close_match_iterator("foo",1):
    if key in close_to_foo:
        # the close match iterator can return the same
        # key multiple times
        print "%s already inserted"%key
    else:
        close_to_foo[key] = value

What's the point if the new code is bigger ? Well, if your tree is big, you may save some memory by iterating into the tree rather than by collecting all the data in a list / list of tuples / dictionary THEN processing it. That's your choice ! Performance-wise, I'd say the collectors should be faster, but I still need to check this.

Friday, December 30 2005

pytst 1.00 !

Here it is !

What's new : after reading Scott Meyers's Effective STL, I finally saw the light and was able to squeeze a little bit more of performance, eventually getting faster using std::vector than what I got with my hand-written vector-like class. This cancels what I've written about std::vector. This shows that STL is a very powerful library, but also that it makes it easy to shoot yourself in the foot if you don't use it correctly. It also shows that Scott Meyers's books are much needed if you want to write, well, effective C++ (wink, wink, nudge, nudge).

As an extra bonus, the new implementation of the pack() method now really packs the tree. Removed nodes were not, and still aren't, removed from the tree. Instead, they are kept for later reuse, so that if a new node is later needed, a removed node can be used instead. This saves a lot of time and memory if you have a dirty phase in your computations where you have to add and remove a lot of nodes. Once the war is over, however, those empty nodes were left in memory, even if the chance of them being reused dropped to zero. Now, when you pack() the tree, those nodes are removed (the tree nodes get defragmented) and the final memory buffer gets shrinked down to the minimum.

It's somewhat comparable to what you do when you "compress" a database (vacuum it in PostgreSQL, compress it in Access or Microsoft SQL Server, optimize it in MySQL, I guess). It has been quite interesting to find this algorithm as I guess it'll be pretty useful when disk storage will come into play.

I'm leaving for Barcelona tomorrow night to spend New Year's Eve with some friends over there. So I won't update this blog till then... So let me give you readers (all 3 of them) my best wishes for 2006 !

Tuesday, December 27 2005

pytst 0.99

The new release is available from the usual place. What's new ? Well, above the surface, there is a new contains(string) method which tells you whether the tree contains a given string or not. Another new thing is a Win32 binary release of pytst using the Boost.Python library, which yields 25% more performance. If you're interested by building a binary version of this for another platform, write me. I've been told that the latest (CVS) SWIG version was much more optimised, but alas, there is no Win32 binary version as of today and I'm too busy to try to build one.

Under the hood, I've been doing a bit of source code refactoring, in preparation of a new on-disk storage implementation. To start with, I plan on using Berkekey DB in Recno mode, since it's quite close to an on-disk vector. If everything goes well, I'll try to do it by myself (of course I don't think I'll achieve the same level of features and quality than the one provided by Sleepycat).

Sadly, it looks like my design for the node storage manager interface is a bit clunky... To do it well, I'll have to refactor so many things that I'm beginning to think about a full rewrite (yeah, I know, it's bad). In this case, I'll release the current version as 1.0 and I'll start over on a 2.0 branch, fully API compatible but with the new on-disk storage feature.

Note that there already is some support for disk storage of a tree ; but it's a cold storage, you populate the tree in memory, then save it on disk, then you can reload it from disk to memory later on. The problem is that the tree is entirely loaded into memory, which limits the usage to huge trees (because you can hold one heck of a huge tree into 1 Gb of RAM), not letting us dwelve into the realm of mega huge trees :). A real on-disk storage would allow me to page in and out parts of the tree from the disk to the RAM.

Another note : I'm currently using pytst to experiment with a kind of full text index that support fast incremental search. So far I've got a working prototype implemented as a layer of Python code relying on pytst. My experiments with a set of files from the Gutenberg project are very satisfying. I can't wait to have an on-disk storage to be able to compete with Google :).

Oddly, when I tried to port my prototype to full C++ (with a Boost.Python interface), I've ended up with a slower index... Even if Python's dict is pretty fast, there must be something I don't understand about the C++ STL, so I bought Scott Meyers' Effective STL to help me there.

Wednesday, December 7 2005

Boost.Python rocks !

After a few hours of struggle, I managed to build a Python binding of tst 0.97 using Boost.Python. The results are great, my (unscientific) performance benchmarks show a 25% performance improvement over the version using SWIG. This is not surprising since SWIG takes your C++ class, builds a flat-view of your class as a set of C functions, then build a Python wrapper around this set of functions.

A few remarks, though :

  • Boost and Boost.Python is a whole new world to discover. If you want things to remain a bit under your control, you HAVE to forget about make, nmake or Visual Studio and learn a new project building tool, bjam. It does not go without its surprises (for example, as a hint, don't name the top-level directory of your project "boost". It will confuse bjam.). Also, I'm a bit worried about the "buildability" of pytst. Up until now, it could be built using python setup.py install, now you have to use a whole new toolset.
  • What's cool is that the .pyd extension does not need a .py module. What's less cool is that it need a runtime DLL named boost_python.dll (but it is built along with your own extension).
  • Boost.Python is quite clever, but it does not understand the char* string, int string_length idiom. SWIG does not, either, but at least you can teach it using SWIG templates. To work around this I had to use std::basic_string which is recognized by Boost and mapped to a Python str type. Anyway, I have to change tst so that it uses basic_strings instead of this idiom which is hackish and not supported by Delphi, either. Because, you know, the Yoono client is developed in Delphi (hint, hint).
  • Boost.Python is very clean and cool, using hardcore C++ meta-programming constructs. The problem is that you don't get to know what is the result of all this magic. There is no code generation, since everything is done through meta-programming. This is a bit disturbing because you have to have an absolute trust in Boost.Python and hope it always does The Right Thing™.
  • Since it uses hardcore C++ meta-programming, I strongly advise you to refresh your knowledge on C++ templates. On the one hand, the mapping code is pretty simple and straightforward (see below), but on the other hand, as soon as you dwelve a bit further in the tricks and trades of overloading pure virtual functions in Python, things get quite hairy (see below, too).

To sum up : what's cool with Boost.Python ? Performance, and code like that :

   class_< TST >("TST")
        .def("put",&TST::put)
        .def("__setitem__",&TST::put)
        
        .def("get",&TST::get)
        .def("__getitem__",&TST::get)
        
        .def("get_or_build",&TST::get_or_build)
        
        .def("remove",&TST::remove)

        .def("write",&TST::write)
        .def("read",&TST::read)

        .def("walk",&TST::walk)

        .def("close_match",&TST::close_match)
        .def("prefix_match",&TST::prefix_match)

        .def("pack",&TST::pack)
    ;

What's less cool is code like that :

template <class S, class T> class string_action_wrapper :
        public action<S,T>,
        public wrapper< action<S,T> > {
    public:
        void perform(S* string,int string_length,int remaining_distance,T data) {
            return call<void,std::basic_string<S>,int,T>(
                this->get_override("perform").ptr(),
                std::basic_string<S>(string,string_length),
                remaining_distance,
                data
            );
        }
        
        T result() {
            return call<T>(this->get_override("result").ptr());
        }
};

But who cares as long as we get some performance ! Mu-HAHAHAHA ! Err, sorry...

Thursday, October 20 2005

pytst 0.97

This release fixes a bug in the latest walk(filter,action,start=None) version : when start was provided, the results were sometimes wrong. I should never release things in a hurry :). Download it here.

Tuesday, October 18 2005

pytst 0.96 : warning, API changes

OK, I should have done this waaaaaaay before. Anyway, I've made those changes to the API :

  • TST.almost(string,maximum_distance,filter,action) has been renamed to TST.close_match(string,maximum_distance,filter,action)
  • TST.common_prefix(string,filter,action) has been renamed to TST.prefix_match(string,filter,action)
  • walk now takes an optional parameter : TST.walk(filter,action,start=None). If start is provided, the walk starts there, giving you all entries in the tree whose key begins with the given string.
  • The NullFilter class is removed, since it is useless. Just pass None if you don't want any filter.

I'm sorry if the migration poses any problems, all the more so that I'm the first who has to refactor some code :) . The method names were very bad, common_prefix was especially confusing. This should be the last API change, the 1.0 version is close now.

Download pytst 0.96 here.

Monday, October 17 2005

pytst 0.95

William Underwood found a bug in the common_prefix method. Bad results where returned... This method was largely unused by me and therefore untested, which proves that you should never publish untested parts of an API, because someone somewhere will use it even if you don't yourself :).

pytst 0.95 is available in the best groceries and also on download here. You'll also find a 0.95.noscan which does not have the Aho-Corasick algorithm but has smaller TST nodes (20 bytes vs 32 bytes, the extra 12 bytes are used to represent the backtracking data). This could interest you if your application is memory intensive.

Thursday, September 22 2005

pytst 0.94

Nothing much new this time, I'm now using setuptools to package the application into an egg file (see EasyInstall for the user's point of view). I have also sneakily used my access on one of the Apache foundation's FreeBSD servers to check whether pytst can be built and used on another platform than Win32. Everything seems OK :).

Download pytst 0.94 here.

Monday, September 19 2005

pytst : ouch !

I had no time to cover this previously (what with my new fatherhood and all), but I recently have been contacted on the 1st of september by Stani of SPE fame. He was interested in using pytst to implement a mass search/replace function. He had tried the regex way but to no avail. I gave him a bit of source code to do this with pytst, and he came back to me with his own version :

DELIMITERS = ' .,?!@():"\'
'
reDELIMITERS = re.compile('([%s])'%DELIMITERS)

class MultiReplaceTST(tst.TST):
    def __call__(self,input_string):
        output = cStringIO.StringIO()
        for source_string, status, replace_with in self.scan_with_stop_chars(
            input_string,DELIMITERS,tst.TupleListAction()):
            if status>0:
                output.write(replace_with)
            else:
                output.write(source_string)
        return output.getvalue()

class MultiReplaceDict(dict):
    def __call__(self,input_string):
        input_string = reDELIMITERS.split(input_string)
        output = cStringIO.StringIO()
        for word in input_string:
            try:
                output.write(self[word])
            except KeyError:
                output.write(word)
        return output.getvalue()

Well, it's hard to admit, but the second version, using Python dict and re, is apparently ten times faster than the first version using pytst... Ouch !

I'll have to test it myself, but I think one of the culprit (besides me :) is SWIG. As I wrote in a previous post, profiling pytst showed that the hotspots were mainly all in the C/C++ wrapping code generated by SWIG. This is very irritating, so I now feel the urge of reimplementing pytst without using SWIG.

Tuesday, July 26 2005

std::vector is very slow ?

I was trying to understand why my latest version of pytst (0.92) was 10% slower than the previous one (0.86). I have made three important changes between those two versions :

  • I used my PythonReference class to ease the Python reference count management. I was fed up with having to manage Py_INCREF and Py_DECREF manually, so I decided to let the C++ compiler do the job itself, following the advice of Scott Meyers.
  • I made a complete overhaul of the memory_storage class, which is the class that, well, manages the storage of the TST nodes in memory (my long term plan is to provide a file_storage class to be able to store and use the tree directly on the filesystem). I was feeling a bit silly to have implemented my own kind of std::vector implementation, so I wisely decided to use std::vector instead. After all, why reinvent the wheel ?
  • The serialization system is now completely independent of the storage implementation. It's like a generic dump of the tree. This way, it will be possible to serialize a tst instance backed by a memory_storage and load it into a tst instance backed by a (not yet available) file_storage. Of course, this point is not related to the performance loss at all.

It turns out that when you use a profiler (I used the 15-days free evaluation of AQtime 4, which has everything I could dream of for a profiler, the first thing being that "it just works"), a lot of methods from std::vector are hotspots. Granted, this is in debug mode, so some code that should be inlined is not, but if a function is an hotspot when called, it has good chances of remaining an hotspot when inlined, don't you think ? Well, I've quickly reimplemented a std::vector-like behaviour right into memory_storage, and presto, I nearly get back my 10% percents of lost performance. Plus, the source code is not so much more complicated.

So, sometimes it's worth reinventing the wheel, especially for simple cases like that.

Anyway, those results are not especially surprising : the Microsoft implementation of std::vector::at (which was the most method that my code used the most) instantiates an iterator and uses the overloaded + operator, only to access a vector element ! I can't imagine how this could be more complicated. I don't know if this kind of convoluted way to do things is also found in other STL implementations, but it is a bit creepy...

Now, after the removal of std::vector, what's the top hotspot ? SWIG_Python_ConvertPtr ! Looks like SWIG will have to disappear... I'm really thinking about using PyCXX or Boost.Python.

It is very frustrating to have to fight against all those layers of foreign code, when all you want to do is optimize your own code ! Anyway, here comes pytst 0.93 !

Tuesday, June 7 2005

Working on an Hash Array-Mapped Tree (HAMT)

I'm currently working on an implementation of an Hash Array-Mapped Tree (HAMT), a replacement for hashtables which is theoretically more efficient. Right now I have the C++ implementation, I still have to work on memory and CPU optimisations as well as Python and Java bindings. I'd like to try something else than SWIG, this time, because I'm not totally satisfied with my experience on pytst. Maybe Boost.Python ?

Yet another Python TST implementation

Here, from a French compatriot ;-). This one implements tail compression, which is an important space optimisation for TSTs. I have not implemented it until now since some of the algorithms implemented in my TST (namely the close match algorithm) would become quite complicated. I'm still thinking about a way to do it efficiently...

Update: this is a TST implementation all right, but without Python bindings.

Wednesday, May 25 2005

pytst 0.85

The reference counting problem should be fixed in this version. I finally understood the various interactions between Python, SWIG and my code. Download it here.

To stop worrying about reference counting, I'm planning on implementing a kind of smart pointer, this way :

class PythonReference {
public:
    inline PythonReference() {
        this->ref=NULL;
    }

    inline PythonReference(PyObject *object) {
        this->ref = object;
        Py_XINCREF(object);
    }

    inline PythonReference(const PythonReference& that) {
        this->ref = that.ref;
        Py_XINCREF(this->ref);
    }

    inline PythonReference& operator= (const PythonReference& that) {
        PyObject* old = this->ref;
        
        if((this->ref = that.ref) != old) {
            Py_XINCREF(this->ref);
            Py_XDECREF(old);
        }
        
        return *this;
    }

    inline PyObject* object() {
        return ref;
    }

    inline ~PythonReference() {
        Py_XDECREF(ref);
    }

private:
    PyObject* ref;
};

That is to say a class encapsulating a Python object reference and automagically incrementing and decrementing the reference count as it is copied and passed in function calls.

- page 1 of 2