At PARC tonight, Norvig gave a quick and interesting presentation on some of the AI concepts being employed at Google. My only takeaway is that Google is powered by the "law of large numbers".
During the talk, a paper by Banko and Brill, "Scaling to very very large corpora for natural language disambiguation", was cited.
What was presented was a graph showing how effectiveness of machine learning algorithms against larger and larger training corpora. The larger the training corpus, the greater the accuracy. So, instead of finding a smarter algorithm, just use a larger corpus. That is, a weaker algorithm can out perform a stronger one if the weaker is trained by an order of magnitude of more data. I guess when you can't be clever, be bigger.
Obviously the point here is that with the ability to apply machine learning algorithms over 1.5k TB of data in a few minutes, you are eventually going to find something useful. See the labs for a few cool examples.
Of course the "law of very large numbers" might be more applicable.
As an aside, this is why I use OTR over Jabber (via Adium).
Leave a comment