I had always found Natural Language Processing (NLP) quite interesting, and a few times had toyed with the idea of exploring it a little more seriously. The buzz around recent YC alum, Wit.AI, finally spurred me into action and I decided to give it a shot.
As with anything, the toughest part was getting started. NLP, despite being a discipline in its nascency, is actually quite expansive in its scope. So, I decided to go for low hanging fruit and start with text-summarization.
I came across this algorithm. It seemed to take a naive extraction-based approach to text-summarisation and looked like something that should’ve worked well. Upon using an implementation of it on random articles from the internet, I was somewhat disappointed with the results. That’s when I found this implementation of essentially the same algorithm, but with a slight tweak that improves results significantly.
To explain it briefly, the algorithm converts the text into a fully-connected, weighted graph, wherein every sentence is a node in the graph and its edges connect it to every other sentence/node in the graph. The weight associated with every edge is an “intersection score” which quantifies how much the two sentences connected by the edge have in common. Finally, the sum of all the interesection-scores/edge-weights of a sentence/node is determined. The sentences with high scores are assumed to be “key sentences” because they have more in common with the other sentences and are thus included in the summary.
I’ve created a small app that uses the modified implementation here. And the code for it can be found here.