Latent semantic indexing (LSI) provides the ability to correlate semantically related terms that could be latent in the set of documents. Latent semantic analysis allows us to see semantics in the set of documents and how we can extract meaning of the text in the documents. This is used to get a response to queries that return more "meaningful" results and not just a keyword search.
Excerpt from Wikipedia: http://en.wikipedia.org/wiki/Latent_semantic_indexing
"A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts..... The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches. Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don’t share a specific word or words with the search criteria”
ABOUT THE CODE: This is a very old code-base that was created in 2008. And publishing this now in Aug 2012 (on public demand). It was created while preparing for my PhD admission - But me and my prospective guide(s) got terribly pissed off at each other - I had first-hand experience of pathetic academic elitism- anyways it's another sad story. Personally, I do not use .NET framework anymore, so will not be making changes to this C# code, unless I have a good reason to do so. This code was also written in a hurry. This is not a very well organized code - perhaps not very efficient either. I encourage people to use this only for reference and implement efficient code of your own. If you do, and willing to share it, please send me the link to your page (or the code) and I’ll update my page to include your link/code. If you like this post and it was helpful to you, please do rate it and leave a comment below.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This has been done before by Dr. E. Garcia and in explained in good enough detail - I won't be doing justice to this topic if I explained it again.
This tutorial is in 5 parts, but if you know what LSI is - jump over to it's 4th part. Some of the variable names in the code also come from these tutorials - so it would be easy to map to the code.
In here, I’ll be explaining the steps followed in the source-code in two parts.
Part 1 - What the code does. You are currently reading it.
Below is the demo implementation in C# .NET.
Examples from a sample collection of documents
Above: Simple search query w.r.t. contents in the sample. Show the matching documents. Rank approximation: 35% of collection
Above: The same query, but see the document in highlighted result. The document mentions word "fighting" which is synonymous to "clash". (Of course, the collection must have enough content to be able to establish synonymy). Rank approximation: 35% of collection
Above: Another random query that just occurred to me. The word "people" is present in the documents. The word "convince" does not occur in the highlighted document. But, interestingly, it ranks documents related to "social engineering". Rank approximation: 35% of collection
Pivoted Document Length Normalization (and cosine normalization): http://ir.iit.edu/~dagr/cs529/files/handouts/singhal96pivoted.pdf
One more: Missing - I used one more web page as a reference - which explained cosine normalization and the creation of weighted term matrix in simpler language and helped me speed up this work. Unfortunately, I am unable to find the reference to that (after all it's 4 years since I last worked on this)