viernes, enero 04, 2008

Nueva versión de Terrier (Terabyte Retriever)

.
Hoy (4/1/8) la gente del Grupo de Recuperación de Información de la Universidad de Glasgow (donde investiga el gran Keith van Rijsbergen) ha anunciado que han liberado la versión 2 de Terrier. Este software, hecho en Java, es un motor de recuperación de información probabilístico que implementa un modelo conocido com DFR (Divergence From Randomness).

Sus características son:

General

  • Indexing support for common desktop file formats, and for commonly used TREC research collections (eg TREC CDs 1-5, WT2G, WT10G, GOV, GOV2, Blogs06).
  • Many document weighting models, such as many parameter-free Divergence from Randomness weighting models, Okapi BM25 and language modelling.
  • Conventional query language supported, including phrases, and terms occurring in tags.
  • Handling full-text indexing of large-scale document collections, in a centralised architecture to at least 25 million documents.
  • Modular and open indexing and querying APIs, to allow easy extension for your own applications and research.
  • Active Information Retrieval research fed into the Open Source platform.
  • Open Source (Mozilla Public Licence).
  • Written in cross-platform Java - works on Windows, Mac OS X, Linux and Unix.
  • Large user-base over 3 years of public release.

Indexing

  • Out-of-the box indexing of tagged document collections, such as the TREC test collections.
  • Out-of-the box indexing for documents of various formats, such as HTML, PDF, or Microsoft Word, Excel and PowerPoint files.
  • Indexing of field information, such as TITLE, H1, HTML tags information
  • Indexing of position information on a word, or a block (e.g. a window of terms within a distance) level.
  • Support for various encodings of documents (UTF), to facilitate multi-lingual retrieval.
  • Highly compressed index disk data structures.
  • Highly compressed direct file for efficient query expansion.
  • Alternative faster single-pass indexing.
  • Various stemming techniques supported, including the Snowball stemmer for European languages.

Retrieval

  • Provides standard querying facilities, as well as Query Expansion (pseudo-relevance feedback)
  • Can be applied in interactive applications, such as the included Desktop Search, or in a batch setting for research & experimentation.
  • Provides many standard document weighting models, including upto 126 Divergence From Randomness (DFR) document ranking models, and other models such as Okapi BM25, language modelling and TF-IDF. The new DFRee DFR weighting model is also included, which provides robust performance on a range of test collections without the need for any paramter tuning or training.
  • Advanced query language that supports boolean operators, +/- operators, phrase and proximity search, and fields.
  • Provides a number of parameter-free DFR term weighting models for automatic query expansion, in addition to Rocchio's query expansion.
  • Flexible processing of terms through a pipeline of components, such as stop-words removers and stemmers.

2 comentarios:

Carlos dijo...

SERI2009 Seminario Español de Recuperación de Información
Viernes 3/4/09
http://documentalista-audaz.blogspot.com/

Carlos dijo...

SERI2009 Seminario Español de Recuperación de Información
Viernes 3/4/09
http://documentalista-audaz.blogspot.com/