WARNING: Release 4.0 has minor binary incompatibilities with previous
releases, mainly due to the move from the interface
it.unimi.dsi.util.LongBigList to the now standard
it.unimi.dsi.fasutil.longs.LongBigList.
It is part of a parallel release
of fastutil, the DSI Utilities,
Sux4J, MG4J, WebGraph, etc. that were
all modified to fit the new interface. It comes in two versions: the standard
version and the big version, which supports >231 terms and documents. Please read our (short)
"Moving Java to Big Data" document for details.
Introduction
MG4J
(Managing Gigabytes for Java) is a free full-text search engine
for large document collections written in Java. MG4J is a highly customisable,
high-performance, full-fledged search engine providing state-of-the-art features
(such as BM25/BM25F
scoring) and new research algorithms.
The main points of MG4J are:
- Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents.
- Efficiency. We do not provide meaningless data such as "we index x GiB per second" (with which configuration? which language? which data source?)—we invite you to try it. MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents.
- Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms.
- Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax.
- Virtual fields. MG4J supports virtual fields—fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document.
- Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It's up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques).
- Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions.
- Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text).
- Multithreading. Indices can be queried and scored concurrently.
- Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries.
The starting point for understanding MG4J is a look at the tutorial, which explains how to index a sample collection and query the newly constructed index from the command line or using a browser. Then, the Javadoc class documentation can provide more insights.
MG4J is free software distributed under the GNU Lesser General Public License. If you find MG4J useful, we kindly ask you to quote the following reference:
@INPROCEEDINGS{BoVTREC2005,
title = "{M}{G}4{J} at {T}{R}{E}{C} 2005",
author="Paolo Boldi and Sebastiano Vigna",
year = 2005,
booktitle = "The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings",
editor = "Ellen M. Voorhees and Lori P. Buckland",
publisher = "NIST",
series = "Special Publications",
number = "SP 500-266",
note = "\texttt{\small http://mg4j.dsi.unimi.it/}",
}
Why Java?
Writing in Java code that (essentially) has to roll bits over and over may seem a Bad Thing™. However, one should take into consideration the following points:
- Improvements in JVMs makes low-level code written in Java faster and faster; often, the performance penalty w.r.t. an equivalent C/C++ application is relatively small.
- Compression techniques can be mixed in several different ways, and an object-oriented language makes it very easy to play with different implementations of the same interface.
- Most of the time, in particular in real-world applications, you will need to rewrite all or part of the code. In this case, as a learning tool a clean implementation in an object-oriented language is certainly better than a C implementation.
- Usually, you need very fast lookups, but you can relax during index construction. Since MG4J writes completely documented bit streams, it is very easy to read its output from optimised C code.
Installation
You can grab MG4J from
Maven Central. Otherwise, you just have to install the
.jar file coming with the distribution and the dependencies, which are gathered for your convenience
in a tarball.
Citations
Here you can find (in no particular order) research papers that have been written using MG4J. The list is not exhaustive, and we will be happy to include works that are missing.
- Soumen Chakrabarti, Devshree Sane, and Ganesh Ramakrishnan. Web-scale entity-relation search architecture. In Proceedings of the 20th International Conference Companion on World-Wide Web, pages 21−22, New York, NY, USA, 2011. ACM.
- Frank Hopfgartner and Joemon Jose. Semantic user profiling techniques for personalised multimedia recommendation. Multimedia Systems, 16:255−274, 2010.
- Jeffrey Pound, Peter Mika, and Hugo Zaragoza. Ad-hoc object retrieval in the web of data. In Proceedings of the 19th international conference on World wide web, pages 771−780. ACM, 2010.
- Erik Graf, Ingo Frommholz, Mounia Lalmas, and Keith van Rijsbergen. Knowledge modeling in prior art search. In Advances in Multidisciplinary Retrieval, volume 6107 of Lecture Notes in Computer Science, pages 31−46. Springer, 2010.
- Eneko Agirre, Olatz Ansa, Xabier Arregi, Maddalen Lopez de Lacalle, Arantxa Otegi, Xabier Saralegi, and Hugo Zaragoza. Using semantic relatedness and word sense disambiguation for (CL)IR. In 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009. Multilingual Information Access Evaluation I. Text Retrieval Experiments, volume 6241 of Lecture Notes in Computer Science, pages 166−173. Springer, 2010.
- Xabier Arregi Maddalen Lopez de Lacalle Arantxa Otegi Xabier Saralegi Hugo Zaragoza Eneko Agirre, Olatz Ansa. Elhuyar-IXA: Semantic relatedness and cross-lingual passage retrieval. In 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009. Multilingual Information Access Evaluation I. Text Retrieval Experiments, volume 6241 of Lecture Notes in Computer Science, pages 273−280. Springer, 2010.
- Nuno Cardoso, Patrícia Sousa, and Mário J. Silva. Experiments with geographic evidence extracted from documents. In Evaluating Systems for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, volume 5706 of Lecture Notes in Computer Science, pages 885−893. Springer, 2009.
- Fabien Campagne. Objective and automated protocols for the evaluation of biomedical search engines using No Title Evaluation protocols. BMC bioinformatics, 9(1):132, 2008.
- Nuno Cardoso, Patrícia Sousa, and Mário J. Silva. The University of Lisbon at GeoCLEF 2008. In Working Notes for CLEF 2008, pages 17−19, 2008.
- Diana Santos, Nuno Cardoso, Paula Carvalho, Iustin Dornescu, Sven Hartrumpf, Johannes Leveling, and Yvonne Skalban. GikiP at GeoCLEF 2008: Joining GIR and QA forces for querying Wikipedia. In Evaluating Systems for Multilingual and Multimodal Information Access, volume 5706 of Lecture Notes in Computer Science, pages 894−905. Springer, 2009.
- Nuno Cardoso, David Cruz, Marcirio Chaves, and Mário Silva. Using geographic signatures as query and document scopes in geographic IR. In Advances in Multilingual and Multimodal Information Retrieval, volume 5152 of Lecture Notes in Computer Science, pages 802−810. Springer, 2008.
- Nuno Cardoso, David Cruz, Marcirio Chaves, and Mário J. Silva. The University of Lisbon at GeoCLEF 2007. In Working Notes for CLEF 2007, 2007.
- Kevin C. Dorff, Matthew J. Wood, and Fabien Campagne. Twease at TREC 2006: Breaking and fixing BM25 scoring with query expansion, a biologically inspired double mutant recovery experiment. In Proceedings of the Text Retrieval Conference, 2006.
- Lei Shi and Fabien Campagne. Building a protein name dictionary from full text: a machine learning term extraction approach. BMC bioinformatics, 6(1):88, 2005.
- Peter Mika. Distributed indexing for semantic search. In Proceedings of the 3rd International Semantic Search Workshop, pages 1−4. ACM, 2010.
- Nuno Cardoso, Mário J. Silva, and Diana Santos. Handling implicit geographic evidence for geographic IR. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 1383−1384. ACM, 2008.
- Roi Blanco and Alvaro Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conference on Information Retrieval Research ECIR2005, number 3408 in Lecture Notes in Computer Science, pages 375−387, 2005.
- Roi Blanco and Alvaro Barreiro. Tsp and cluster-based solutions to the reassignment of document identifiers. Information Retrieval, 9(4):499−517, 2006.
- Roi Blanco and Alvaro Barreiro. A software architecture for effective document identifier reassignment. In Roberto Moreno Díaz, Franz Pichler, and Alexis Quesada Arencibia, editors, Computer Aided Systems Theory - EUROCAST 2005, 10th International Conference on Computer Aided Systems Theory, Las Palmas de Gran Canaria, Spain, February 7-11, 2005, Revised Selected Papers, volume 3643 of Lecture Notes in Computer Science, pages 254−262. Springer, 2005.
- Minsuk Lee, Weiqing Wang, and Hong Yu. Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC Bioinformatics, 7(140), 2006.