Yesterday, talking with a good friend, he told me he needed a good algorithm to detect keywords (relevant words) from a document. The first algorithm that came out from my head was a simple word frequency counter, discarding common words by building a list of stop-words with a previous learning. This algorithm is pretty obvious and I’m sure it is very used out there.
Then Googling for some papers (I have a bunch on my laptop but I do not recall where I stored it) I found a paper that opened my mind (TextRank: Bringing order into Texts). It suggests to build a graph of words, then apply the PageRank Algorithm to the graph in order to know relevant words. I haven’t read it deeply yet, but I’ve got that idea with a brief reading, and it makes sense, I’m wondering why I never thought about it.
I’m planning to code it, just as a proof-of-concepts during this week. Basically I will use some old code that I’ve coded (but never finished) awhile ago, I remember I build it very modular using classes, so adapt that code for these needs will be pretty straightforward. And in the graph of words (and sets of 1, 2, 3 words probably), the previous word will reference the next word (If you have no idea about what I said here, just take a look here).
I will post the results here.
crodas 7:35 pm on October 7, 2009 Permalink
Well, I’ve decided to test this out on the lunch time, I Googled for a simple Pagerank implementation in Python (because I haven’t finished my own implementation in PHP yet) and I found the one that I’ve used some time ago (http://www.eioba.com/a69792/the_google_pagerank_algorithm_in_126_lines_of_python ).
My code itself was very simple, it just split-up the text into words, and treats every word as a webpage that links to the next word and to the previous word.
Then I run my simple program against http://en.wikipedia.org/wiki/RAID and it returns this set of words, in this order.
raid, disk, data, parity, disks
Of course, a bunch unused words were in the among these lists (the, an, in) but I removed it, and do it automatically is pretty straightforward
I will keep playing this night to find two and n-words *keyphrase*.
David Hofmann 9:53 am on October 8, 2009 Permalink
Hmm, Cesar you are doing very interesting stuff
. I think here are very few people that cares about this kind of stuff. Keep on working Cesar ! I see this stuff working for blogposts to quickly show the relevant information on the top of the page as normaly people fail to make a good introducction of what they are going to talk about
David Hofmann 9:59 am on October 8, 2009 Permalink
Like me
crodas 10:13 am on October 8, 2009 Permalink
Well, I have a similar idea, probably I will create a web service that suggests keywords in order to improve SEO for bloggers.
Also the suggestion of synonyms, that has a better SEO, could be an valid application for this cool algorithm.
David Hofmann 4:47 pm on October 8, 2009 Permalink
Now make a dzone clone and arrange the articles with k-means + TextRank
Gilberto Ramos 11:31 am on October 16, 2009 Permalink
I wish I would have enough time to research and code something too!
I just want to finish university! Then I’ll have a normal geek life!
propsimmige 10:45 am on November 2, 2009 Permalink
Other variant is possible also
crodas 10:43 am on November 9, 2009 Permalink
Such as?
ActiveMongo « César's geek-side 1:47 am on February 27, 2010 Permalink
[...] crodas on Weird but cool Pagerank’…propsimmige on Weird but cool Pagerank’…Gilberto Ramos on Latinoware 2009Gilberto Ramos [...]