Wednesday, November 09, 2005

Experiments with Text Similarity.

If you do read the intercodes blog then you might have had some vauge hint of me having a 'censored' project, but thats not really that censored.. and I have discussed about that on this blog before. However, seeing some competition right before my eyes, sort-a gave me a rude awakening.(btw, I also registered there to check it out).

And one of the things that is required for that project (apart from a realistic deadline, and and some kick-ass commitment) is a good automated text-similarity algorithm. not the Text::Similarity types (you would know that if you had something to do with perl), but something on a more fundamental level. Approaches which basically match documents on semantic content. So, I basically experimented with some basic algorithms and here are some results. So whatever I've come up with ain't so promising .. but thats just very basic single nights hacking. If anyone else is interested they can of course use the code. its on the MIT Licencse (i.e.. take the code, and don't blame me if loose your hair and your wife runs away.) Since these are just experiments.. we won't care too much about efficiency of code, but rather on getting there.

So each of the textual matter in question is basically stored in flat out text files with no extra formatting around. Just the post content. so we need a quick and dirty function to read stuff from a file and convert it into a list of non-repeating words. Thats easy enough, if we define a small helper function to split a string on space. (for all those who know the Common Lisp Ansi Spec on the top of your heads, is there a better way to do this?)

so here is the splitter function.


(defun split-on-space (input-string)
"splits the input string on space into a list of strings"

(let ((old-pos 0) (inspace nil) (ret ()))
(loop for char across input-string
for position from 0
if (or (not (graphic-char-p char)) (char= #\Space char))
do (unless inspace
(setq inspace t)
(if (eql old-pos position) (values)
(push (subseq input-string old-pos position) ret)))
else
do (when inspace
(setq inspace nil)
(if (eql old-pos position) (values)
(setq old-pos position)))
finally (return
(progn (unless inspace
(push (subseq input-string old-pos) ret))
(nreverse ret))))))


well, now that we can successfuly split strings into their component words, assuming they are seperated by spaces. (usually they are not, but this will have to do for now), we can write our function which swallows a file whole and gives us a list of words occuring in the file.


(defun read-from-file (filename &key (remove-small t))
"reads a text file and splits it into a list of words"

(with-open-file (file filename :direction :input)
(do ((line (read-line file nil) (read-line file nil))
(words ()))
((null line) (nreverse words))
(dolist (word (split-on-space line))
(if (and remove-small (> (length word) 3))
(push word words))))))


Since we are assuming its primarily textual data (kadhai in local-speak), its safe to assume the data is line oriented. (note-to-self : a better way would be to read that into a string buffer and then split-on-space the stringbuffer). Also, there is this small key value remove-small which by default is set to t. remove-small is a small flag which decides if words of length lesser than 3 are included or not. the default value of 't' is not such a wise idea, as I learnt later. The reason being most geek-speak and abbreviations are 3 lettered (eg: C++, AST, GC, SCM, DOS, WIN,NIX). So, reading this way would perform badly on textual content with a degree of code in them (like this one).

Since its a really good idea to generate a frequency table and we can easily combine that with the process of removing repeated words. So to get the job done quick and dirty, I am using an a-list here, but for larger sets a hash table would give better peformance.


(defun generate-freqtab (object-list equals-function)
"generate-freqtab object-list equals-function, uses the equals-function for
comparison and returns an a-list of objects and their relative frequencies"


(let ((frequency-table ()))
(dolist (item object-list frequency-table)
(if (null (assoc item frequency-table :test equals-function))
(push (cons item 1) frequency-table)
(incf (cdr (assoc item frequency-table :test equals-function)))))))


The above function is defined to be a bit more generic than it needs to. One simplification could be made on the equals-function defaulting to #'string=. Now that we have our alist based frequency table.. we can check out a whole bunch of measures, now we will look at two really disappointing ones, but not totally. They gave me some quick insight into the nature of random textual matter as opposed to more sophisticated and rather correct techniques.

So, we will first look at the common word count, but its also usually helpful if we know what are the common words.


(defun n-similarity (ftab1 ftab2 equals-function)
"counts the number of datum that are present in both the frequency tables
frequency values are ingored for the moment."


(let ((n 0) (common-words ()))
(dolist (word ftab1 (values common-words n))
(if (assoc (car word) ftab2 :test equals-function)
(progn
(incf n)
(push (car word) common-words))))))



This isn't particularly useful for identifying positives, as it gets in lots of false positives through, and further more the usefullness of this is limited by general english words such as which, where, therefore which occour much more commonly than what would be considered as the words containing actual content. (I would consider these words as semantic indicators, rather than contentful words). So the next obvious experiment is to look at the common words and their frequencies, so if there is some difference that should iron out. (like there is high probability that words like 'which' will be used approximately the same number of times, so taking a difference of frequency should fix that). So, we will write a small function, basically an imporvement over the above one.


(defun nf-similarity (ftab1 ftab2 equals-function)
"counts the number and frequency of datam that are present in both the tables
and builds small list of list out of it."


(let ((nf-list ()))
(dolist (word ftab1 nf-list)
(let ((second-word (assoc (car word) ftab2 :test equals-function)))
(if second-word
(push (list (car word) (cdr word) (cdr second-word)) nf-list))))))



Hmm.. but the results aren't too suprising.. It seems, even though posts of similar topics do have these words in common, they aren't used similarly enough to actually use 'difference' of frequencies viabily. but all is not lost, atleast there are a couple of important lessons to learn about the whole text similarity thing.

  • You can't ignore small words because, they might be important after all.

  • You can't use word occurances blindly because english has more snytactic crap than Java does!!

  • And, you can't uses frequencies blindly because, english is a very quirky language that allows even semantically similar topics to be written in a weirdly different ways.

  • This needs a much more fancy research schedule!



And if any one's reading this blog, and have some suggestions or things I can look into (preferably with solid data to back stuff up..) please do mail me.

Signing Off,
Vishnu Vyas.

No comments: