Yesterday, talking with a good friend, he told me he needed a good algorithm to detect keywords (relevant words) from a document. The first algorithm that came out from my head was a simple word frequency counter, discarding common words by building a list of stop-words with a previous learning. This algorithm is pretty obvious and I’m sure it is very used out there.
Then Googling for some papers (I have a bunch on my laptop but I do not recall where I stored it) I found a paper that opened my mind (TextRank: Bringing order into Texts). It suggests to build a graph of words, then apply the PageRank Algorithm to the graph in order to know relevant words. I haven’t read it deeply yet, but I’ve got that idea with a brief reading, and it makes sense, I’m wondering why I never thought about it.
I’m planning to code it, just as a proof-of-concepts during this week. Basically I will use some old code that I’ve coded (but never finished) awhile ago, I remember I build it very modular using classes, so adapt that code for these needs will be pretty straightforward. And in the graph of words (and sets of 1, 2, 3 words probably), the previous word will reference the next word (If you have no idea about what I said here, just take a look here).
I will post the results here.