How does <100ms, variable-length summarization of a 200-page book sound? What if it only used open source tools, and was exposed via a an API that you’re already used to if you’re running Solr? How about we throw in the fact that all of the heavy lifting is done by established components, so the implementation is basically a bit of glue code?
As a teaser, here’s its output on The Art of War — it isn’t perfect, but it gives you an idea of the content, and took about 60ms on my desktop:
By method and discipline are to be understood the marshaling of the army in its proper subdivisions, the graduations of rank among the officers, the maintenance of roads by which supplies may reach the army, and the control of military expenditure. 11. … Now the general who wins a battle makes many calculations in his temple ere the battle is fought. The general who loses a battle makes but few calculations beforehand. Thus do many calculations lead to victory, and few calculations to defeat: how much more no calculation at all! … Hence a wise general makes a point of foraging on the enemy. … Therefore in chariot fighting, when ten or more chariots have been taken, those should be rewarded who took the first. … Hence to fight and conquer in all your battles is not supreme excellence; supreme excellence consists in breaking the enemy’s resistance without fighting. 3. … Therefore the skillful leader subdues the enemy’s troops without any fighting;
The Key Pieces
Apache Solr is awesome. Its most common use is to enable search, but it can be used for a few related tasks too. One quick example: I’m using it to do the document processing steps for my thesis work, because you can easily configure it via XML, it works quickly, and the contents of the index can be inspected without too much trouble.
For this summarization tool, I’m combining two features: MoreLikeThis and the PostingsHighlighter. MoreLikeThis does what you might think it does—finds similar documents. PostingsHighlighter is used to find the best piece of text within a document for a given query.
A common way to find similar documents is to use the words present in a document to represent it in high-dimensional space. Each coordinate of that space corresponds to one word, and the value is usually weighted by some formula that combines the importance of the word to the document and the importance of the word in the context of your whole corpus. The most common weighting scheme is Term Frequency – Inverse Document Frequency (TF-IDF). This basically says that high values should be cases where the word appears a lot in this document, but not a lot in our whole corpus. If this isn’t clear, post in the comments and I’ll make another post on it.
Given that Solr uses an Inverted Index to perform searches, using that type of similarity would take a really long time. You would need to search for every term that’s in the query document, which would be very slow. To get around that, MoreLikeThis essentially does an approximation of the “ideal” query. It first finds the terms with the top TF-IDF in the current document, then constructs a query using a limited set of those. Since it only has to search for maybe 10-50 terms, the time taken to run to query is a lot less than if it had to search for all of the potentially thousands of unique terms in a document.
Highlighting is an important piece of search — a lot of times the user can get their information need filled from a good snippet from the page. A clever way to figure out which snippets to return was added in Solr 4.2, and there’s a nice description of it on Mike McCandless’ blog. The basic idea is to break the document into sentences, then treat each sentence as a separate document. The sentences with the highest scores should be returned. My favorite part of this is that it can utilize the same tools that are used for the core search functionality.
Putting it together
If you peek at the code or JavaDocs for
MoreLikeThis within Lucene,
you see that a
MoreLikeThis object can create a
Query. Normally, that
query gets applied to the other documents in your collection, but
that’s not what we want here. We just give the query that comes from
MoreLikeThis.like method to the
method, and get a nice
String containing our summary.
What is this doing? It finds terms that should be representative of the document, then finds sentences that contain a lot of those terms.
My implementation has a SearchComponent that handles this, so it can easily be added to any other search. The parameters are mostly the ones from MoreLikeThis. I simply added one that turns the component on, and another that determines the number of sentences to return.
In order to make it easier to see how things were working, I modified the very simple Handlbars.js-based page from SearchNuggest to display results.
The data came from a convenience sample of Project Gutenberg. (Pro tip- convenience sample often is a euphemism for lazy research. In my case, Windows Update shut down my computer while stuff was downloading, and I didn’t want to figure out where to resume). I have a little over 5,000 documents loaded into a simple Solr index, and didn’t bother with linking it to the structured metadata. I left in the boilerplate, so there is a lot of stuff that we wouldn’t want showing up in a summary.
Here’s the 5-minute frontend displaying the results from Frankenstein:
One more example, this time a shorter summary of Sherlock Holmes:
The words that show up in bold are the ones that MoreLikeThis picked out as important. Clearly, we have some room for improvement, but this seems like a decent starting point.
What can be improved?
It seems like Lucene’s implementation of TF-IDF is not downweighting common terms enough for this purpose. Look at the last sentence of the Sherlock Holmes summary: the words “said” and “much” are showing up as important words. One way to handle this is to change the IDF calculation to be more like that of BM-25. To see why we might want to do that, let’s take a look at this plot from Wolfram Alpha
This shows the IDF for a corpus that has 5000 documents as the number of documents containing a word goes from 1 to 5000. The blue line is the IDF used in Solr’s TF-IDF, and the red line is BM-25 according to Wikipedia. The difference is pretty striking: BM-25 will give negative weights to terms that appear in more than half of the documents, while the difference in weights using TF-IDF when DF goes from 50% to 100% is minimal.
Another thing that’s a bit bigger of a task would be result diversification. If we get some of the keywords in one sentence, it might be more useful to get the other words than a lot of similar sentences. I don’t really have any concrete thoughts on how to make that happen efficiently in Solr.
Some Comments on Usefulness
When I started thinking about document summarization, I thought that solving the problem would be of huge practical significance. This is nowhere close to solving the problem, and I think it’s pretty far from state-of-the-art in the NLP literature, but I am still going back on that original thought.
In almost every case I can think of, if a document is long enough to be worth summarizing, someone has already written a summary. If it is a journal article, the abstract is usually a good summary. If it’s a book, the introduction and/or conclusion will give you the basic idea. With news stories, the headline is a short summary, and the first paragraph is often a summary. A machine-generated extractive summary will almost always be worse than a human-generated one, especially if the same author writes both.
One case where it can be useful is when you have unstructured documents with a lot of boilerplate, but want to enable browsing and searching by non-free-text fields. The Project Gutenberg files are a good example of that: they all start and end with some type of boilerplate, but it isn’t uniform across all files. If this were for a real requirement and not a blog post, I could probably make a regular expression to remove it for indexing purposes, then the beginning of the document might be a reasonable “summary” if the boilerplate has a lot of different versions, that would get tricky, and something like this summary approach might be useful.
The summaries shown here are extracted from public domain works, and the digitization was done as part of the very cool Project Gutenberg. I’m not sure I completely understand the license, so I’m including their snippet and link, which it seems like I’m supposed to do if these quotes count as derivative works:
This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org.
Getting the Code
If you’re interested in using this, let me know and I’ll see if I can get permission from my employer to make it publicly available. If you have any questions, suggestions, or other comments, let me know below!