We are happy to announce that the following workshops and tutorials will be held at CIKM 2015.
The 8th Ph.D. Workshop in Information and Knowledge Management
Full day Workshop
Mouna Kacimi - Free University of Bozen-Bolzano, Italy
Nicoleta Preda - University of Versailles, France
Maya Ramanath -Indian Institute of Technology, India
User Modeling in Heterogeneous Search Environments
Full day Workshop
Aleksandr Chuklin - University of Amsterdam, The Netherlands & Google Switzerland
Yiqun Liu -Tsinghua University, China
Ilya Markov - University of Amsterdam, The Netherlands
Maarten de Rijke -University of Amsterdam, The Netherlands
Topic Models: Post-Processing and Applications Full day Workshop
Nikolaos Aletras - University College London, UK
Jey Han Lau - King's College London, UK
Timothy Baldwin - The University of Melbourne, Australia
Mark Stevenson - University of Sheffield, UK
ACM Eighteenth International Workshop On Data Warehousing and OLAP Full day Workshop
Il-Yeol Song - Drexel University, USA
Carlos Garcia-Alvarado - Pivotal Software Inc., USA
Carlos Ordonez - University of Houston, USA
Workshop on Large-Scale and Distributed System for Information Retrieval Full day Workshop
Ismail Sengor Altingovde -METU, Turkey
B. Barla Cambazoglu - Yahoo Labs, Spain
Nicola Tonellotto - ISTI-CNR, Italy
Understanding the City with Urban Informatics Full day Workshop
Yashar Moshfeghi - University of Glasgow
Iadh Ounis - University of Glasgow
Craig Macdonald - University of Glasgow
Joemon M. Jose - University of Glasgow
Peter Triantafillou - University of Glasgow
Mark Livingston - University of Glasgow
Piyushimita Thakuriah - University of Glasgow
Exploiting Semantic Annotations in Information Retrieval Full day Workshop
Krisztian Balog - University of Stavanger, Norway
Jeffrey Dalton - Google Research, USA
Antoine Doucet - University of La Rochelle, France
Yusra Ibrahim - Max Planck Institute for Informatics, Germany
ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics Full day Workshop
Doheon Lee - KAIST, Korea
Min Song - Yonsei University, Korea
Karin Verspoor - University of Melbourne, Australia
Evaluation on Collaborative Information Retrieval and Seeking Full day Workshop
Leif Azzopardi - School of Computing Science at the University of Glasgow, UK
Jeremy Pickens - Catalyst Repository Systems , USA
Tetsuya Sakai - Waseda University, Japan
Laure Soulier - IRIT-Paul Sabatier University, France
Lynda Tamine-Lechani - IRIT-Paul Sabatier University, France
First International Workshop on Novel Web Search Interfaces and Systems Full day Workshop
Davood Rafiei - University of Alberta
Katsumi Tanaka - Kyoto University
This full day tutorial focuses on explaining and building formal models of Information Seeking and Retrieval. The tutorial is structured into four sessions. In the first session we will discuss the rationale of modelling and examine a number of early formal models of search (including early cost models and the Probability Ranking Principle). Then we will examine more contemporary formal models (including Information Foraging Theory, the Interactive Probability Ranking Principle, and Search Economic Theory). The focus will be on the insights and intuitions that we can glean from the math behind these models. The latter sessions will be dedicated to building models to optimise particular objectives that drive how users make decisions, in general, (i.e. a how-to guide on model building) and then describe different techniques (including analytical, graphical and computational) that can be used to generate hypotheses from such models. In the final session, participants will be challenged to develop a simple model of interaction applying the techniques learnt during the day, before concluding with an overview of challenges and future directions.
Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing.
This tutorial will provide an accessible introduction to those not already familiar with Spark and its potential to revolutionize academic and commercial data science practices. It is divided into two parts: the first part will introduce fundamental Spark concepts, including Spark Core, data frames, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization. Industrial applications and deployments of Spark will also be presented. Example code will be made available in python (PySpark) notebooks.
Rademacher Averages and the Vapnik-Chervonenkis dimension are fundamental concepts from statistical learning theory. They allow to study simultaneous deviation bounds of em- pirical averages from their expectations for classes of functions, by considering properties of the problem, of the dataset, and of the sampling process. In this tutorial, we survey the use of Rademacher Averages and the VC-dimension for developing sampling-based algorithms for graph analysis and pattern mining. We start from their theoretical foundations at the core of machine learning, then show a generic recipe for formulating data mining problems in a way that allows using these concepts in the analysis of efficient randomized algorithms for those problems. Finally, we show examples of the application of the recipe to graph problems (connectivity, shortest paths, betweenness centrality) and pattern mining. Our goal is to expose the usefulness of these techniques for the data mining researcher, and to encourage research in the area.
Afternoon
B. Barla Cambazoglu - Yahoo Labs
Ricardo Baeza-Yates - Yahoo Labs
Commercial web search engines need to process thousands of queries every second and provide responses to user queries within a few hundred milliseconds. As a consequence of these tight performance constraints, search engines construct and maintain very large computing infrastructures for crawling the Web, indexing discovered pages, and processing user queries. The scalability and efficiency of these infrastructures require careful performance optimizations in every major component of the search engine. This tutorial aims to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in the above-mentioned components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points at the open research problems and provides recommendations to researchers who are new to the field.
The evolution of the Web from a technology platform to a social ecosystem has resulted in unprecedented data volumes being continuously generated, exchanged, and consumed. User-generated content on the Web is massive, highly dynamic, and characterized by a combination of factual data and opinion data. False information, rumors, and fake contents across multiple sources can be easily spread, making it hard to distinguish between what is true and what is not. Truth discovery also called fact-checking has recently gained lot of interest in Data Science communities. Ascertaining the veracity of data and understanding the dynamics of misinformation in the Web are two inter-dependent challenges for researchers and practitioners in Databases, Information Retrieval, and Knowledge Management.
This tutorial explores the progress that has been made in discovering truth, checking facts, and modeling the propagation of falsified and distorted information in the context of Big Data. We will review in details current models, algorithms, and techniques proposed by various research communities in Complex System Modeling, Data Management, and Knowledge Discovery, for ascertaining the veracity of data in a dynamic world. Finally, this tutorial will identify a wide range of open problems and research directions for discovering truth from falsehood(s) in the Web Data and understanding the evolution and propagation of information source trustworthiness.
Afternoon
A/Prof Dr Xue Li - School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Australia
Social media and networks are a popular place for people to express their opinions about consumer products, to organize or initiate social events, or to spread news. Some questions would be asked in order to understand the social media and social networks: how can we detect and predict the emerging sensitive events? How can we predict the propagation patterns of online micro-blogs? How can we understand people's opinions about a current issue, a new product, or an important event? This tutorial is to present recent research progress on data analytics on social media and social networks. A few application systems and relevant algorithms will be presented for answering above questions.
Topics of the tutorial
Morning
Hua Lu - Aalborg University, Denmark
Muhammad Aamir Cheema - Monash University, Australia
A large part of modern life is lived indoors such as in homes, offices, shopping malls, universities, libraries and airports. However, almost all of the existing location-based services (LBS) have been designed only for outdoor space. This is mainly because the global positioning system (GPS) and other positioning technologies cannot accurately identify the locations in indoor venues. Some recent initiatives have started to cross this technical barrier, promising huge future opportunities for research organisations, government agencies, technology giants, and enterprising start-ups -- to exploit the potential of indoor LBS. Consequently, indoor data management has gained significant research attention in the past few years and the research interest is expected to surge in the upcoming years. This will results in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Then, we provide an overview of the existing research in indoor data management. Finally, we discuss the future research direction in this important and growing research area.
In this tutorial, we aim at providing a unified and comprehensive overview of the state-of-the-art approaches to distance-based multimedia indexing. We intend to cover a broad target audience starting from beginners to experts in the domain of distance-based similarity search in multimedia databases and adjacent research fields which utilize distance-based approaches. No prerequisite knowledge is needed.
We begin with outlining different approaches to object representations including the feature extraction process and suitable feature representation models as well as clustering-based computations in order to answer the question of how to model multimedia data objects in a compact and generic way. In the second part of this tutorial, we present state-of-the-art similarity and dissimilarity measures including kernels and distance functions in order to complete our understanding of a similarity model. The third part is devoted to approaches for efficient query processing. After introducing similarity queries, we show how to process such queries efficiently by means of multi-step filter-and-refinement algorithms and lower bounding. The last part finally covers indexing approaches for distance-based similarity models where we discuss the fundamentals of spatial indexing, high-dimensional indexing, as well as metric and ptolemaic indexing.
https://theory.stanford.edu/~sergei/tutorial/
Morning
Sergei Vassilvitskii - Google
Grigory Yaroslavtsev - University of Pennsylvania
The MapReduce style of parallel processing has made certain operations nearly trivial to parallelize - Word Count is the canonical "Hello World" example. Still, parallelization of many problems, e.g., computing a good clustering, or counting the number of triangles in a graph, requires effort; since straight forward approaches yield almost no speedups over a single machine implementation. This tutorial will cover recent results on algorithm design for MapReduce and other modern parallel architectures. We begin with an overview of the framework, and highlight the challenge of avoiding communication and computational bottlenecks. We then introduce a toolkit of algorithmic strategies for dealing with large datasets using MapReduce. The goal of most of these approaches is to reduce the data size (from petabytes and terabytes to gigabytes and megabytes), while preserving its structure relevant to the problem of interest. Sketching, composable coresets, and adaptive sampling all fall into this category of approaches. We then turn to specific applications to both showcase these techniques, and highlight recently developed practical methods. Our initial focus is on clustering, whose many variants form the core of data analysis. We cover the classic clustering methods, such as k-means, as well as more modern approaches like correlation clustering and hierarchical clustering. We then turn to methods for graph analysis, building up our intuition with algorithms for graph connectivity and moving onto graph decompositions, matchings, spanning trees and subgraph counting.