I'm currently working on Information Retrieval and Extraction. If you want a general, light-hearted introduction to this field and have the inclination to spend time, read this story written by me. More Con Tent will come soon.
My Major Thesis title was "Robust HTML to DOM conversion and applications", and my report is available in PostScript and PDF formats. An outcome of my project is the Hypertext Parsing Suite, which we have found to be pretty useful in our tasks of information extraction from and fine-grained analysis using the HITS and PageRank algorithms on HTML documents.
Here are some links to help you move ahead on the topic.