The 24th International World Wide Web Conference in Florence, Italy on May 18th

Workshop Day

May 18th 2015 afternoon (a half day workshop)

Workshop Objectives

While there has been a long-standing focus of empirical evaluation in building Web-based services like recommender and advertising systems, new challenges have been faced by traditional evaluation methodologies to truly reflect the systems’ actual online performance. This workshop aims to connect academic researchers and industrial practitioners who are working on, or interested in, online and offline evaluation of Web-based services. The goal is to provide a forum so that

* Industrial practitioners can expose real-world challenges and share practical experiences.

* Academic researchers can popularize state-of-art research.

* Collaboration between the two can be fostered.

Topics and Themes

The workshop will be a half-day event, consisting of invited talks and contributed talks/posters/demos, welcoming all topics related to evaluation for Web applications. A balance between academia and industry will be attempted.

Topics of the workshop include, but are not limited to, the following:

* Classic offline evaluation methodologies of systems, especially those based on standard metrics such as RMSE and NDCG.

* Online controlled experiments such as A/B testing, etc.

* Online interleaving experiments.

* Online adaptive sequential experiments.

* Offline evaluation of direct metrics such as revenue and click-through rate gains.

* Causal inference and counterfactual analysis based on log data.

* Practical applications and lessons related to evaluation on the Web.

Examples of open questions we would like to discuss include (and are not limited to):

* What are the best ways we can measure the performance of a machine learning algorithm offline? When are traditional machine learning criteria such as precision, recall, and AUC good enough for reflecting actual quality of the system?

* What are the more efficient ways of doing online experiments, other than the vanilla version of randomized controlled experiments? Can adaptive sequential experimentation techniques (such as Yelp’s MOE ) be helpful? Can variants such as interleaving be useful for applications other than information retrieval?

* What effective offline evaluation techniques do we have for measuring conversions/revenue gain, and what are their limitations?

* How can we measure the confidence in the estimates, either in the online or offline cases? Should we use t-tests or Bayesian statistics?