Nicolas Lehuen's Weblog

To content | To menu | To search

Friday, May 29 2009

FriendFeed might actually benefit from Google Wave

With the current buzz about Google Wave, it is difficult not to feel sorry for FriendFeed. The common ground between Wave and FriendFeed is indeed pretty rich : real time communication, threaded conversations, lightweight social networking... Some people even suggest that there is some irony in seeing Google rip off good ideas from FriendFeed, the latter being founded by ex-googlers.

However, there is something peculiar about Wave : apart from the very Googly UI, the infrastructure is meant to be opened and decentralized. Wave can be seen as a way to collaboratively edit an XML document (the wavelets, with are the basic communication nodes found in waves). The point is that this XML document can be replicated on many providers' infrastructures, and updates are propagated through extensions of the XMPP protocol (XMPP is AKA Jabber, it's an instant messaging protocol that Google Talk also uses with extensions).

Check out this paper about Google Wave Federation Infrastructure.

So, we are not really in a winnner-takes-all scenario. Now, if I where FriendFeed, I'd have two choices, now.

1) Decide this is a Fire and Motion move from Google, aimed at first at Microsoft (to take the wind out of Bing), then Facebook, then FriendFeed. Keep away from this game, and slowly become an awesome app no one uses, dedicated to geeks and power users, and watch people have all the fun out there on Wave or Facebook.

2) Embrace Google Wave Federation Protocol and build gateways and proxies that enable FriendFeed users to participate into Google Wave discussion without leaving their favorite UI. Of course this means that FriendFeed might become a victim of the Fire and Motion scheme : Google adds new stuff into Google Wave, FriendFeed runs to support it, then Google adds new stuff, and on and on, which means that FF could loose its technical leadership on the market. However, there is certainly room for more than one Wave provider, and if this new way of communication becomes mainstream, this could mean that FriendFeed would at last find a way out of Geekland and reach a wide audience, wider than it is currently now. Then launch a feature war with Google on the social networking or real-time search front, both fronts on which they have already proven that they good do very well.

So, depending on the choice they make now, I think Google Wave could actually be good news for FriendFeed, turning their awesome power-app into the second player in a potentially huge new market.

Sunday, April 19 2009

The difference between pytst and ctst

Putting away the difference in languages (C++ for pytst, pure C for ctst), the two projects are quite different with regards to the way data is stored in the tree.

pytst implements a text-book ternary search tree with an AVL balancing algorithm. While I was working on this project, I designed a scanning algorithm that I particularly proud of, until I found out that it was already known as the Aho-Corasick algorithm. This scanning algorithm is handy when you want to match an input text against a huge quantity of search strings (which was the initial use case that led me to develop this).

The problem with pytst is that it is a very straightforward ternary search tree implementation, in that it requires at least a tree node per character per unique prefix in the input strings. Given all the meta data that can reside in a tree node (at minimum, we need the character plus up to three pointers to other nodes, plus one pointer for Aho-Corasick, plus anything required by the memory allocator), there is a huge memory overhead. The algorithms pytst implements may be pretty efficient, but the trade off in memory is quite taxing.

While I was working for Yoono, we decided to store huge quantities of URLs in memory using a TST, since the hierarchical nature of URLs from the same domain was a good fit for this structure. This time, the implementation language was Java, but we developed some pretty nifty tricks to optimize memory usage ; basically I reimplemented malloc/free running on int[], char[] etc. This way we would reduce the load on the Java memory allocation, object header overhead, GC, etc. We could then store millions of node in memory. This was cool, and it worked well, but it was a bit stupid from me the optimize the allocation issues before optimizing what was allocated. Fortunately, a clever colleague from Yoono (Yann Landrin-Schweitzer) took over my work and went further, optimizing the data structure itself based on some discussion he, Laurent Quérel and I had together.

ctst started as a simple rewrite in pure C of pytst, motivated by the difficulties I encountered with compiling the C++ code of pytst with different compilers, and building bindings for Ruby (which I had started using at that time). I also wanted to separate as much as possible the code implementing the algorithms from the code implementing the code storage, which was something we managed to do quite cleanly in Java. The ideas about optimizing the structure came by during the rewrite, so I finally decided to jump ship and implement them in ctst as well.

So how does ctst store data in the tree ? We took some of the ideas from Patricia trees and mixed it with the B-Tree principles. The result is a pretty compact and efficient structure. To illustrate it, let's use a little example.

Suppose we want to store this list of words in ctst :

compacity
compact
compacted
community
commuter
commuters
compute

ctst build this tree (click on the image to zoom) :

Compacity test

( This drawing was done automagically using the dump method, which generate a Graphviz .dot file. )

This gives 11 nodes for 7 strings. I won't waste time to draw the ternary search tree equivalent, so you'll have to trust me, but you would need at least 25 nodes. Of course the node don't contain the same things, but the payload/overhead ratio is highly in favor of ctst.

Of course I haven't re-implemented all pytst algorithms in ctst, but now that I "finally" managed to correct a long-standing but in the codebase, I'll be able to get back at it. So stay tuned !

Thursday, April 16 2009

Mapping my personal web

My personal web

Everything starts with a slice of the Web, in the guise of a bunch of RSS feeds. Currently my blog roll hover a little above 280 feeds. There are the broad blockbusters like Techcrunch, ReadWriteWeb, Techmeme, Presse-Citron (a French blog), the mainstream media (e.g. French newspapers), but there are also a lot of small, specialized, "vertical" blogs from developers, entrepreneurs, competitors and/or friends.

Of course I cannot cope with the quantity of content that those ~280 feeds fetch. Some people are very anxious about reading everything in their blog roll, which for instance led to the introduction of a new anti-feature in Google Reader : "hide unread count". I was a bit like that back in pre-history (circa 2004) but nowadays I just tell myself that I won't be able to read everything, c'est la vie.

One thing is sure, I won't miss anything important thanks to the "echo chamber" effect. As soon as big news hit the interpipes, I see it multiplied a few dozens of times on many feeds, and even if I missed it during the first 24 hours, I get a second chance with the French mainstream media which are often 48 hours late.

So each day I get the big news fast and I can quickly skip over the echoes. Then I get a bunch of small news, "vertical" articles, things that border between work, geeky things, technical articles, friends updates and so on.

My main tool for reading all this is Google Reader. More precisely, I mainly use the iPhone version of Google Reader during my daily subway trips, which gives me around 60 to 90 minutes a day (morning + evening) to do some triage and select interesting items to review later at my home or office (I star those items in Google Reader), or items to share right away.

The iPhone version of Google Reader is especially well designed for triage : the app has a mode in which it only displays 15 unread items at a time, and there is a "mark as read" link which marks all those items as read and loads 15 new items right away. So that's what I do in the subway : just skip items whose title doesn't seem interesting, read interesting items, star those I want to come back later on a bigger screen or share them right away. Believe me, since I've began doing this, I've never been bored while in the subway or while taking a cab :). Google Reader tells me that in the last 30 days, I've read 5.384 items which gives nearly 12 screens of 15 items a day.

BTW, I guess the batch loading mode of the iPhone app was designed to address connection problems, and it's especially effective in the subway. For example, on my daily subway trip, there is a weak spot in my operator's coverage for one Metro station, so I take care of loading a batch just before entering the station. This way, I don't need the connection while in the weak spot. I sometimes get frustrated when I go on Metro lines that I don't know as well, though :)

The items that I share are reviewed by friends directly on Google Reader, but they can also be found on a public RSS feed that is injected into FriendFeed for comments there, too. The RSS feed is also injected in one of our corporate blog powered by WordPress (+ FeedWordPress) so that my colleagues and I can discuss about it far from the prying eyes of our competitors :). Things could change now that Google Reader has a true commenting system with some privacy features, but having a private blog hosted by ourselves is much more reassuring for now.

For fun, I've got a few more services aggregated into FriendFeed : my Amazon wish list (feel free :) ), my public activity feed from github (I've got two projects there, though I'm not very active), LinkedIn feeds, etc.

Finally, FriendFeed posts everything to Twitter (except what's coming from Twitter, of course, the echo chamber metaphor should remain one), so that I don't have to do it myself :). To be frank, the reason I made FriendFeed push to Twitter is that for now, I don't really use Twitter, but it seems to be quite important nowadays... So this is a cheap way to push updates on Twitter without going all the way into micro-blogging mode.

So, does anybody have some more tricks to share about how to manage the gazillion tons of information the Web throws at us each day ?

Thursday, April 2 2009

pytst 1.17

pytst 1.17 is out !

The source code can be fetched using git. If you prefer tarballs, head to the downloads page, where you'll also find Win32 binaries.

Included in this release :

  • Support for 64 bits architectures (tested under Linux only) - thanks to Thomas Brox Røst for the patch.
  • The test script (in python/test/test.py) is now self-contained, no more nasty references to a "tcc" module which was used at my company. Thanks to Paul Harrington for the prodding about this :), and for our first fork / merge test.

BTW, github is really, really cool. The only thing missing is an issue tracker. Maybe in a future release ?

Wednesday, April 1 2009

pytst now hosted on github

I've created an account on github and ported my private SVN repository to a public git repository. Now you can fork on the project as you want and eventually ask me to pull your modifications (using the "send pull request" functionality). The wiki will be a good place to document the library.

Hopefully this is the beginning of a new life cycle for pytst, I keep getting bugs/enhancement requests from time to time but I haven't enough spare time to address them.

Here is the repository home page : http://github.com/nlehuen/pytst/

- page 1 of 25