Date ranges in Lucene

Posted by Mike Haller on Saturday, April 4. 2009 at 13:23 in Java
Lucene is a very efficient and fast Java search engine. Once indexed, any object can be found by looking for attribute matches. An object in Lucene is called Document and its attributes are called Fields. Lucene uses Query and Filter objects to narrow down the search to what the user wants to find. A Query makes the amount of data to be searched smaller while a Filter is used for more fine-grained control over what a search result shall include. Queries are more memory-hungry than filters, but for the usual use cases, both are very good.

In this post, I'd like to show how you can search for Date and Time ranges, for example modification dates of files. I'd like to find all files on my local file system which were modified in April 2009.

To do that, i first create an Index containing all my files. Each file is represented by a Lucene Document and only contains the full filename and the modification date.
final String INDEX_DIR = "c:/temp/index/";
final boolean CREATE_INDEX = true;
final IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(),
                                    CREATE_INDEX, IndexWriter.MaxFieldLength.LIMITED);
final FSVisitor fsVisitor = ... // Visits every folder and file recursively 
for (final File root : File.listRoots()) {
    fsVisitor.visit(root); // calls onFile(), see below
}
writer.optimize();
writer.close();

The code within the visitor creates the Lucene Document and adds two fields, path and modified to the document. Both are not analyzed, which means their text value is kept as-is and not normalized or broken up into multiple words in any way. There are analyzers for every language and they are used for full-text search. We don't need Analyzers for this demo anyway. The Field.Store.YES tells Lucene to store the actual value of both the filename and the modification time as-is into the index, so we can display them later. Usually, you want to add one field which is a unique identifier for each document, such as an absolute filename or a primary key if you are indexing some business entities.
public void onFile(final File file) {
   final Document doc = new Document();
   doc.add(new Field("path",
        file.getPath(),
        Field.Store.YES,
        Field.Index.NOT_ANALYZED));
   doc.add(new Field("modified",
        DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE),
        Field.Store.YES,
        Field.Index.NOT_ANALYZED));
   writer.addDocument(doc);
}


After running this on my local hard disk ("C:/" Drive only), i had a 50MB index file which contained every file. Now, I can go on and write a little test case for searching files which were modified in April. First, I'll create two Date objects containing the first date and the last date I wish to include in the search:
final DateFormat dateFormat = DateFormat.getDateInstance(DateFormat.DEFAULT,Locale.GERMAN);
final Date from = dateFormat.parse("01.04.2009");
final Date to = dateFormat.parse("31.04.2009");

Then, I translate these dates into a form which can be processed by Lucene:
final String sFrom = DateTools.dateToString(from, DateTools.Resolution.DAY);
final String sTo = DateTools.dateToString(to, DateTools.Resolution.DAY);

The sFrom and sTo variables now contain values in the form 20090431. That's the starting point for creating the Lucene Query and Filter.

The following code defines the set of documents to be looked for (all) and the filter criteria to be applied for the search result:
Query query = new MatchAllDocsQuery();
Term lowerTerm = new Term("modified", sFrom);
Term upperTerm = new Term("modified", sTo);
RangeFilter filter = new RangeFilter("modified", lowerTerm.text(), upperTerm.text(), true, true);


The lowerTerm and upperTerm variables contain Lucene Terms. A Term is simply a Key/Value pair. The key is a Field-name and the value is, well, the value to search for as String. Both boolean true values tell Lucene to include the given dates in the range, in contrast to make the exclusive. Take care whether you are using a resolution of minutes or days and use the flags accordingly.

We're ready to start the search and let Lucene do all the hard work:
final IndexSearcher searcher = new IndexSearcher(INDEX_DIR);
final TopDocs search = searcher.search(query, filter, 15);

That tells Lucene to search for maximum 15 documents, you can increase the number of course. Now, let's print out all the search results on the console:
final ScoreDoc[] scoreDocs = search.scoreDocs;
for (final ScoreDoc scoreDoc : scoreDocs) {
	final Document doc = searcher.doc(scoreDoc.doc);
	System.out.println(String.format("%s %s", doc.get("modified"), doc
			.get("path")));
}

If you run this code, you will see a print out like the following:
200904021935 C:\
200904021726 C:\Boot\BCD
200904021726 C:\Boot\BCD.LOG
200904020636 C:\Program Files\Common Files\Microsoft Shared\vgx
200904020636 C:\Program Files\Internet Explorer
200904020636 C:\Program Files\Internet Explorer\de-DE
200904020636 C:\Program Files\Internet Explorer\en-US
200904012053 C:\Program Files (x86)\Adobe\Reader 8.0\Reader
200904020636 C:\Program Files (x86)\Common Files\microsoft shared\vgx
200904021722 C:\Program Files (x86)\Common Files\Steam
200904041022 C:\Program Files (x86)\Common Files\Symantec Shared\CCPD-LC\symlcrst.dll
200904020636 C:\Program Files (x86)\Internet Explorer
200904020636 C:\Program Files (x86)\Internet Explorer\de-DE
200904020636 C:\Program Files (x86)\Internet Explorer\en-US
200904020636 C:\Program Files (x86)\Internet Explorer\SIGNUP


Happy indexing!


Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications
 
Submitted comments will be subject to moderation before being displayed.
 

About

My name is Mike Haller and I'm a software developer and architect at Bosch Software Innovations in Germany. I love programming, playing games and reading books. I like good food, making photos and learning and mentoring about the craftsmanship of commercial software development. Stack Overflow profile for mhaller

Quicksearch