Apuntes, son solo apuntes: Nueva versión de Terrier (Terabyte Retriever)

viernes, enero 04, 2008

Nueva versión de Terrier (Terabyte Retriever)

Hoy (4/1/8) la gente del Grupo de Recuperación de Información de la Universidad de Glasgow (donde investiga el gran Keith van Rijsbergen) ha anunciado que han liberado la versión 2 de Terrier. Este software, hecho en Java, es un motor de recuperación de información probabilístico que implementa un modelo conocido com DFR (Divergence From Randomness).

Sus características son:

General

Indexing support for common desktop file formats, and for commonly used TREC research collections (eg TREC CDs 1-5, WT2G, WT10G, GOV, GOV2, Blogs06).
Many document weighting models, such as many parameter-free Divergence from Randomness weighting models, Okapi BM25 and language modelling.
Conventional query language supported, including phrases, and terms occurring in tags.
Handling full-text indexing of large-scale document collections, in a centralised architecture to at least 25 million documents.
Modular and open indexing and querying APIs, to allow easy extension for your own applications and research.
Active Information Retrieval research fed into the Open Source platform.
Open Source (Mozilla Public Licence).
Written in cross-platform Java - works on Windows, Mac OS X, Linux and Unix.
Large user-base over 3 years of public release.

Indexing

Out-of-the box indexing of tagged document collections, such as the TREC test collections.
Out-of-the box indexing for documents of various formats, such as HTML, PDF, or Microsoft Word, Excel and PowerPoint files.
Indexing of field information, such as TITLE, H1, HTML tags information
Indexing of position information on a word, or a block (e.g. a window of terms within a distance) level.
Support for various encodings of documents (UTF), to facilitate multi-lingual retrieval.
Highly compressed index disk data structures.
Highly compressed direct file for efficient query expansion.
Alternative faster single-pass indexing.
Various stemming techniques supported, including the Snowball stemmer for European languages.

Retrieval

Provides standard querying facilities, as well as Query Expansion (pseudo-relevance feedback)
Can be applied in interactive applications, such as the included Desktop Search, or in a batch setting for research & experimentation.
Provides many standard document weighting models, including upto 126 Divergence From Randomness (DFR) document ranking models, and other models such as Okapi BM25, language modelling and TF-IDF. The new DFRee DFR weighting model is also included, which provides robust performance on a range of test collections without the need for any paramter tuning or training.
Advanced query language that supports boolean operators, +/- operators, phrase and proximity search, and fields.
Provides a number of parameter-free DFR term weighting models for automatic query expansion, in addition to Rocchio's query expansion.
Flexible processing of terms through a pipeline of components, such as stop-words removers and stemmers.

2 comentarios:

Carlos dijo...: SERI2009 Seminario Español de Recuperación de Información
Viernes 3/4/09
http://documentalista-audaz.blogspot.com/; 8:37 p. m.
Carlos dijo...: SERI2009 Seminario Español de Recuperación de Información
Viernes 3/4/09
http://documentalista-audaz.blogspot.com/; 8:39 p. m.

Publicar un comentario

Apuntes, son solo apuntes

viernes, enero 04, 2008

Nueva versión de Terrier (Terabyte Retriever)

General

Indexing

Retrieval

2 comentarios:

Acerca de mi

Seguidores

Archivo de entradas

Enlaces

Estadísticas