Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Search behavior, and more generally, information-seeking behavior is often motivated by tasks that prompt search processes that are often lengthy, iterative, intermittent, and characterized by distinct stages, shifting goals and multitasking. Current search engines do not provide adequate support for tackling complex tasks (e.g. planning a trip, surveying a topic), due to which the cognitive burden of keeping track of such tasks and completing them is placed on the searcher. Ideally, a search engine should be able to decipher the underlying reason that led the user to submit a query (i.e., the actual task that caused the query to be issued), and be able to guide the user to achieve their task by incorporating this knowledge about the actual information need.

In this research, we hypothesize that developing a comprehensive understanding of user’s tasks would help in providing better support and recommendations to users based on their contextual information and as a result, help users accomplish the task. As part of the proposed research, we consider the challenge of extracting tasks from a given collection of search log data and present task extraction techniques which rely on recent advancements in bayesian non parametrics and word embeddings. We evaluate the performance of such techniques using a number of techniques based on crowdsourced judgments as well as labelled ground truth data.

2 Task Based Information Retrieval

Our efforts at developing task based retrieval systems have focussed around three major themes, (i) understanding searcher’s behaviors, (ii) developing task extraction techniques and (iii) showing the benefits of task information via improved personalization. We next describe each of them in detail.

2.1 Understanding Searcher’s Task Behavior

While a major share of prior work have considered search sessions as the focal unit of analysis for seeking behavioral insights [7–9], search tasks are emerging as a competing perspective in this space. In a recent work [1], we quantify multi-tasking behavior of web search users and show that over 50 % of search sessions have more than 2 tasks. Further, we provide a method to categorize users into focused, multi-taskers or supertaskers depending on their level of task-multiplicity and show that the search effort expended by these users varies across the groups. Additionally, in a follow up work [3] we relate user’s multitasking propensities to tasks and topics. Specifically, we analyze user-disposition, topic and user-interest level heterogeneities that are prevalent in search task behavior. We find that not only do users have varying propensities to multi-task, they also search for distinct topics across single-task and multi-task sessions. The findings from our analysis provide useful insights about task-multiplicity in an online search environment and hold potential value for search engines that wish to personalize and support search experiences of users based on their task behavior.

2.2 Extracting Hierarchies

An important first step in developing task based systems is task extraction. In a recently published work [4], we considered the challenge of extracting hierarchies of search tasks and their associated subtasks from a given search log given just the log data without the need of any manual annotation of any sort. We present an efficient Bayesian nonparametric model for discovering task hierarchies and propose a tree based bayesian hierarchical task construction algorithm to discover this rich hierarchical structure embedded within search logs. Our model organises the queries into a nested hierarchy T of tasks/subtasks, with all queries in one node at the root and singleton queries at the leaves. We interpret a tree (T) as a mixture of partitions over those group of queries (Q). We define the probability of a group of such queries as:

$$\begin{aligned} p(Q|T) = \sum _{\phi } p(\phi (t)) p(Q|\phi (t)) \end{aligned}$$
(1)

where \(p(\phi (T))\) is the mixing proportion of partition \(\phi (T)\), and \(p(Q|\phi (t))\) is the probability of the group of queries Q given a partitioning by \(\phi (T)\). In general the number of partitions consistent with T can be exponentially large. To make computations tractable, we define the mixture model in such a way that \(p(Q|\phi (t))\) can be computed using dynamic programming over T:

$$\begin{aligned} p(Q | T) = \pi _T f(Q) + (1 - \pi _t) \prod _{T_i \in ch(T)} p(leaves(T_i)|T_i) \end{aligned}$$
(2)

In the beginning, each query is regarded as a tree on its own. For each step, the algorithm selects two trees \(T_i\) and \(T_j\) and merges them into a new tree \(T_m\). Unlike binary hierarchical clustering, we allow three possible merging operations: (i) Join: \(T_m = \lbrace T_i, T_j\rbrace \), such that the tree \(T_m\) has two children now; (ii) Absorb: \(T_m = \lbrace children(T_i) \cup T_j\rbrace \), i.e., the children of one tree gets absorbed into the other tree forming an absorbed tree with >2 children; and (iii) Collapse: \(T_m = \lbrace children(T_i) \cup children(T_j)\rbrace \), all the children of both the sub-tree get combined together at the same level. Such a setting allows each task to be composed of an arbitrary number of sub-tasks without restricting tasks to contain only binary subtasks.

The tree is built in a bottom-up greedy agglomerative fashion, and the algorithm finishes when just one tree remains. At each iteration a pair of trees in the forest F is chosen to be merged by considering the pair and type of merger that yields the largest Bayes factor improvement over the current model. Further details of the work are available in our research paper [4].

2.3 Decomposing Complex Search Tasks

Quite often, search tasks (e.g. planing a trip) are complex and conceptually decompose into a set of sub-tasks (e.g. booking flights, finding places of interest etc.), each of which warrants the user to further issue multiple queries to solve. Given a collection of on-task queries (extracted using standard task extraction algorithm), we proposed a distance dependent Chinese Restaurant process model to extract these sub-tasks from a given collection of on-task queries.

In our sub-task extraction problem, each task is associated with a dd-CRP and its tables are embellished with IID draws from a base distribution over mixture component parameters. Let \(z_i\) denote the ith query assignment, the index of the query with whom the ith query is linked. Let \(d_{ij}\) denote the distance measurement between queries i and j, let D denote the set of all distance measurements between queries, and let f be a decay function. The distance dependent CRP independently draws the query assignments to sub-tasks conditioned on the distance measurements,

$$\begin{aligned} p(z_i = j | D,\alpha ) \propto {\left\{ \begin{array}{ll} f(d_{ij}) &{} \text {if } j \ne i \\ \alpha &{} \text {if } j = i \end{array}\right. } \end{aligned}$$

Here, \(d_{ij}\) is an externally specified distance between queries i and j, and \(\alpha \) determines the probability that a customer links to themselves rather than another customer. Given a decay function f, distances between queries D, scaling parameter \(\alpha \), and an exchangeable Dirichlet distribution with parameter \(\lambda \), N M-word queries are drawn as follows,

  1. 1.

    For \(i \in [1, N]\), draw \(z_i \sim dist-CRP(\alpha , f, D)\).

  2. 2.

    For \(i \in [1, N]\),

    1. (a)

      If \(z_i \notin R^{*}_{q_{1:N}}\), set the parameter for the ith query to \(\theta _i = \theta _{q_i}\). Otherwise draw the parameter from the base distribution, \(\theta _i \sim Dirichlet(\lambda )\).

    2. (b)

      Draw the ith query terms, \(w_i \sim Mult(M, \theta _i)\).

Further details of the work are available in our research paper [2].

2.4 Task Based Personalization

In order to demonstrate the usefulness of a task based system, in recent work [5, 6] we presented a novel approach to couple user’s topical interest information with their search task information & their term usage behavior to learn a joint user representation technique. We demonstrated that coupling user’s task information with their topical interests indeed helps us build better user models. We show through extensive experimentation that our task based method outperforms existing query term based and topical interest based user representation methods. By evaluating the quality of our approach on a variety of tasks for personalisation including collaborative query recommendation, cluster based recommendation and user cohort analysis, we demonstrate that the proposed methods result in better user profiles.

3 Conclusion

In this note, we offered insights about the shift in focus from sessions to tasks and presented a brief summary of our recent work aimed at extracting tasks from search logs. We believe that the task-based personalization and recommendation has the potential to shape the future of user interaction systems for the upcoming era of intelligent Web, and there is much to be done on this emerging topic. Some of the key problems to investigate in the future include using task based systems for improved recommendations and better predicting contextual needs of users for proactive recommendations.