TREC Dynamic Domain Track 2017

TREC Dynamic Domain Track 2017 Guidelines

The goal of the dynamic domain (DD) track is to support research in dynamic, exploratory search of complex information domains. DD systems receive relevance feedback as they explore a space of subtopics within the collection in order to satisfy a user's information need.

1. Participation in TREC

In order to take part in the DD track, you need to be a registered participant of TREC. The TREC Call for Participation, at, includes instructions on how to register. The datasets and relevance judgments will be made generally available to non-participants after the TREC 2017 cycle, in February 2018. So register to participate if you want early access.

2. Domains and datasets for 2017

The TREC DD track provides interesting and understudied domains of documents. In 2017, we provides two datasets, Ebola and New York Times, from different domains.

The Ebola dataset contains webpages, tweets and other records related to the Ebola outbreak in Africa from 2014 to 2015. Documents in this dataset are crawled from internet and most of them are in html format.

The New York Times dataset contains articles published in New York Times from 1987 to 2007. Documents in this dataset are manually organized and annotated. All the documents in this dataset are stored in NITF.

Please check Datasets page for more details.

3. Topics

Within each domain, there will be 25-50 topics that represent user search needs. The topics have been developed by the NIST assessors in a period of three weeks. A topic (which is like a query) contains a few words. It is the main search target for one complete run of dynamic search. Each topic contains multiple subtopics, each of which addresses one aspect of the topic. The NIST assessors have tried (very hard to) produce a complete set of subtopics for each topic, and so we will treat them as the complete set and use them in the interactions and evaluation.

An example topic from the 2015 Illicit Goods domain topic set may be found here. It is about "paying for amazon book reviews" and contains 2 subtopics.

The topics will be made available from the Tracks page in the TREC Active Participants area. You cannot access this page without registering for TREC. If you lose your active participants password, you will need to contact

4. Task Description

The topics file contains not only the queries but in fact the full ground truth data: subtopics, relevant documents, and highlighted passages. DO NOT READ THE FILE. The file should be used only as input to the jig, and your system receives the truth data via the jig. If you examine the topics file, your run may not be labeled automatic but rather is a manual run. When your "run" starts, your system will communicate with the jig via a simple API. Your system indicates that it is ready for a new query, and the jig will give you a query along with its domain label. Your system can use that query to search the domain collection and return up to five documents to the jig. The jig will reply with relevance information for any of those retrieved documents that have been judged. Your systems will receive an initial query for each topic, where the query is two to four words and additionally indicates the domain by a number 1, 2, 3 or 4. In response to that query, systems may return up to five documents to the user. The jig (acting as a simulated user) will respond by indicating which of the retrieved documents are relevant to their interests in the topic, and to which subtopic the document is relevant to. Additionally, the simulated user will identify passages from the relevant documents and assign the passages to the subtopics with a graded relevance rating. The system may then return another five documents for more feedback. Systems should stop until they believe they have covered all the user's subtopics sufficiently.

The subtopics are not known to your system in advance; systems must discover the subtopics from the user's responses.

The jig only gives relevance information when it exists. If the jig gives no information about a document your system retrieved, it does not mean that the document is not relevant, it means that the user hasn't examined it. Your system should assume this partial relevance situation, NOT the traditional TREC interpretation that unjudged documents are not relevant.

The following picture illustrates the task:


5. User Simulation ("Jig") and Feedback Format

The system's interactions with the user can be simulated by a jig that the track coordinators provide. This jig runs on Linux, Mac OS, and Windows. You will need the topics with ground truth to make the jig work, and your system may only interact with the ground truth through the jig that we provide.

You can find the jig program here.

6. Task Measures

The measurements of runs are Cube Test, sDCG and Expected Utility. Scoring scripts are included in the jig. other diagnostic measures such as precision and recall may also be reported.

Cube Test is a search effectiveness measurement evaluating the speed of gaining relevant information (could be documents or passages) in a dynamic search process. It measures the amount of relevant information a system could gather and the time needed in the entire search process. The higher the Cube Test score, the better the IR system.

sDCG extends the classic DCG to a search session which consists of multiple iterations. The relevance scores of results that are ranked lower or returned in later iterations get more discounts. The discounted cumulative relevance score is the final results of this metric.

Expected Utility scores different runs by measuring the relevant information a system found and the length of documents. The relevance scores of documents are discounted based on ranking order and novelty. The document length is discounted only based on ranking position. The difference between the cumulative relevance score and the aggregated document length is the final score of each run.

Jiyun Luo, Christopher Wing, Hui Yang, and Marti Hearst. 2013. The water lling model and the cube test: multi-dimensional evaluation for professional search. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 709-714.
Kalervo Järvelin, Susan L Price, Lois ML Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain based evaluation of multiple-query IR sessions. In European Conference on Information Retrieval. Springer, 4-15.
Yiming Yang and Abhimanyu Lad. 2009. Modeling expected utility of multi-session information distillation. In Conference on the Theory of Information Retrieval. Springer, 164-175.

7. Run Format

In TREC, a "run" is the output of a search system over all topics. In the DD track, the runs are the output of the harness jig. Participating groups typically submit more than one run corresponding to different parameter settings or algorithmic choices. The maximum number of runs allowed for DD-2017 is five from each team.

We use a line-oriented format similar to the classic TREC submission format:

topic_id               docno       ranking_score         on_topic          subtopic_rels

where 'on_topic' is 1 or 0 if the document is relevant to any subtopic, and the subtopic_rels indicate graded relevance for the document for all relevant subtopics. For instance:
                        topic_id                                           docno                              ranking_score  on_topic  subtopic_rels
DD15-1 2322120460-d6783cba6ad386f4444dcc2679637e0b 883.000000 1 DD15-1.1:3|DD15-1.4:2|DD15-1.4:2|DD15-1.4:2|DD15-1.4:2|DD15-1.4:2|DD15-1.2:2|DD15-1.2:2
DD15-1 1322509200-f67659162ce908cc510881a6b6eabc8b 564.000000 1 DD15-1.1:3
DD15-1 1321860780-f9c69177db43b0f810ce03c822576c5c 177.000000 1 DD15-1.1:3
DD15-1 1320503040-e8c92486dc3462e4a352c4fd41d3a723 66.000000 0
DD15-1 1327908780-d9ad76f0947e2acd79cba3acd5f449f7 25.000000 1 DD15-1.3:2|DD15-1.1:2

8. Requirements

Participants are expected to submit at least one run by the deadline.

Runs may be fully automatic, or manual. Manual indicates intervention by a person at any stage of the retrieval. We welcome unusual approaches to the task including human-in-the-loop searching, as this helps us set upper performance bounds.