Oct 09, 2008 15:18
So the Perl scripts I write (at work) are for a professor's data mining project. They're sort of "helper scripts" to assist her main (Phd-level) algorithm.
I made one today that successfully extracts phrases from text, based on their percieved importance :D
These phrases are 2 to n words long. For example if I set n=5, I can sucessfully capture "the united states of america" assuming it appears more than once (otherwise it's totally impossible imho unless you do something huge like the Stanford Natural Language Parser).
Since I'm using freqency, it may not seem too hard, but you have to remember, if "the united states of america" appears X times, then so does "the united", "united states", "the united states of", and so on. And you cannot just use the "longest superstring" in every case, because, for example "states of america" is a substring of "states of america and", but is the latter important? Is either actually important? So you have to determine a weighting system based on their relative frequencies as well. In otherwords, there is filtering to be done before output.
So ...yeah, patting myself on the back :P