Accepted Papers

Scientific Track
Authors:

Ehsan Amid, Gionis Aristides, Ukkonen Antti

Affiliation(s):
Aalto University
Finnish Institute of Occupational Health

Abstract:
We consider the problem of clustering a given dataset into k clusters subject to an additional set of constraints on relative distance comparisons between the data items.The additional constraints are meant to reflect side-information that is not expressed in the feature vectors, directly.Relative comparisons can express structures at finer level of detail than must-link (ML) and cannot-link (CL) constraints that are commonly used for semi-supervised clustering.Relative comparisons are particularly useful in settings where giving an ML or a CL constraint is difficult because the granularity of the true clustering is unknown.Our main contribution is an efficient algorithm for learning a kernel matrix using the log determinant divergence(a variant of the Bregman divergence)subject to a set of relative distance constraints.Given the learned kernel matrix,a clustering can be obtained by any suitable algorithm, such as kernel k-means.We show empirically that kernels found by our algorithm yield clusterings of higher quality than existing approaches that either use ML/CL constraints or a different means to implement the supervision using relative comparisons.

Scientific Track
Authors:

Tinh Tran, Alex Aussem

Affiliation(s):
University Lyon 1

Abstract:
Covariate shift is a specific class of selection bias that arises when the marginal distributions of the input features X are different in the source and the target domains while the conditional distributions of the target Y given X are the same. A common technique to deal with this problem, called importance weighting, amounts to reweighting the training instances in order to make them resemble the test distribution. However this usually comes at the expense of a reduction of the effective sample size. In this paper, we show analytically that, while the unweighted model is globally more biased than the weighted one, it may locally be less biased on low importance instances. In view of this result, we then discuss a manner to optimally combine the weighted and the unweighted models in order to improve the predictive performance in the target domain. We conduct a series of experiments on synthetic and real-world data to demonstrate the efficiency of this approach.

Scientific Track
Authors:
Zhanxing Zhu, University of Edinburgh
Amos Storkey, University of Edinburgh

Abstract:
We consider a generic convex-concave saddle point problem with a separable structure, a form that covers a wide-ranged machine learning applications. Under this problem structure, we follow the framework of primal-dual updates for saddle point problems, and incorporate stochastic block coordinate descent with adaptive stepsizes into this framework. We theoretically show that our proposal of adaptive stepsizes potentially achieves a sharper linear convergence rate compared with the existing methods. Additionally, since we can select ``mini-batch" of block coordinates to update, our method is also amenable to parallel processing for large-scale data. We apply the proposed method to regularized empirical risk minimization and show that it performs comparably or, more often, better than state-of-the-art methods on both synthetic and real-world data sets.

Scientific Track
Authors:
Sebastian Wagner, Ludwig-Maximilians-University
Max Zimmermann, University of Magdeburg
Ntoutsi Eirini, Ludwig Maximilian Univ. Munich
Myra Spiliopoulou, University Magdeburg

Abstract:
The long-term analysis of opinionated streams requires algorithms that predict the polarity of opinionated documents, while adapting to different forms of concept drift: the class distribution may change but also the vocabulary used by the document authors may change.One of the key properties of a stream classifier isadaptation to concept drifts and shifts;this is typically achieved through ageing of the data.Surprisingly, for one of the most popular classifiers, Multinomial Naive Bayes (MNB), no ageinghas been considered thus far.MNB is particularly appropriate for opinionated streams, because it allows the seamless adjustment of word probabilities, as new words appear for the first time. However, to adapt properly to drift, MNB must also be extended to take the age of documents and words into account.In this study, we incorporate ageing into the learning process of MNB, by introducing the notion of fading for words, on the basis of the recency of the documents containing them. We propose two fading versions, gradual fading and aggressive fading, of which the latter discards old data at a faster pace.Our experiments with Twitter data show that the ageing based MNBs outperform the standard accumulative MNB approach and manage to recover very fast in times of change.We experiment with different data granularities in the stream and different data ageing degrees and we show how they ``work together" towards adaptation to change.

Scientific Track
Authors:
Amos Storkey, University of Edinburgh
Zhanxing Zhu, University of Edinburgh
Jinli Hu, University of Edinburgh

Abstract:
Trading in information markets, such as machine learning markets, has been shown to be an effective approach for aggregating the beliefs of different agents. In a machine learning context, aggregation commonly uses forms of linear opinion pools, or log opinion pools. It is interesting to relate information market aggregation to the machine learning setting.In this paper we introduce a spectrum of compositional methods, Renyi divergence aggregators, that interpolate between log opinion pools and linear opinion pools. We show that these compositional methods are maximum entropy distributions for aggregating information from agents subject to individual biases, with the Renyi divergence parameter dependent on the bias. In the limit of no bias this reduces to the optimal limit of log opinion pools. We demonstrate this relationship practically on both simulated and real datasets.We then return to information markets and show that Renyi divergence aggregators are directly implemented by machine learning markets with isoelastic utilities, and so can result from autonomous self interested decision making by individuals contributing different predictors. The risk averseness of the isoelastic utility directly relates to the Renyi divergence parameter, and hence encodes how much an agent believes (s)he may be subject to an individual bias that could affect the trading outcome: if an agent believes (s)he might be acting on significantly biased information, a more risk averse isoelastic utility is warranted.

Scientific Track
Authors:
Jiwoong Im, University of Guelph
Ethan Buchman, University of Guelph
Graham Taylor, University of Guelph

Abstract:
Energy-based models are popular in machine learning due to the elegance of their formulation and their relationship to statistical physics. Among these, the Restricted Boltzmann Machine (RBM), and its staple training algorithm contrastive divergence (CD), have been the prototype for some recent advancements in the unsupervised training of deep neural networks. However, CD has limited theoretical motivation, and can in some cases produce undesirable behaviour.Here, we investigate the performance of Minimum Probability Flow (MPF) learning for training RBMs. Unlike CD, with its focus on approximating an intractable partition function via Gibbs sampling, MPF proposes a tractable, consistent, objective function defined in terms of a Taylor expansionof the KL divergence with respect to sampling dynamics. Here we propose a more general form for the sampling dynamics in MPF,and explore the consequences of different choices for these dynamics for training RBMs.Experimental results show MPF outperforming CD for various RBM configurations.

Scientific Track
Authors:
Yuanli Pei, Oregon State University
Li-Ping Liu, Oregon State University
Xiaoli Fern, Oregon State University

Abstract:
Clustering can be improved with pairwise constraints that specify similarities between pairs of instances. However, randomly selecting constraints could lead to the waste of labeling effort, or even degrade the clustering performance. Consequently, how to actively select effective pairwise constraints to improve clustering becomes an important problem, which is the focus of this paper. In this work, we introduce a Bayesian clustering model that learns from pairwise constraints. With this model, we present an active learning framework that iteratively selects the most informative pair of instances to query an oracle, and updates the model posterior based on the obtained pairwise constraints. We introduce two information-theoretic criteria for selecting informative pairs. One selects the pair with the most uncertainty, and the other chooses the pair that maximizes the marginal information gain about the clustering. Experiments on benchmark datasets demonstrate the effectiveness of the proposed method over state-of-the-art.

Scientific Track
Authors:

Tom Diethe, Niall Twomey, Peter Flach

Affiliation(s):
University of Bristol

Abstract:
Typically, when analysing patterns of activity in a smart home environment, the daily patterns of activity are either ignored completely or summarised into a high-level "hour-of-day" feature that is then combined with sensor activities. However, when summarising the temporal nature of an activity into a coarse feature such as this, not only is information lost after discretisation, but also the strength of the periodicity of the action is ignored. We propose to model the temporal nature of activities using circular statistics, and in particular by performing Bayesian inference with Wrapped Normal (WN) and WN Mixture (WNM) models. We firstly demonstrate the accuracy of inference on toy data using both Gibbs sampling and Expectation Propagation (EP), and then show the results of the inference on publicly available smart-home data. Such models can be useful for analysis or prediction in their own right, or can be readily combined with larger models incorporating multiple modalities of sensor activity.

Scientific Track
Authors:
Nipa Chowdhury, UNSW
Xiongcai Cai, UNSW
Cheng Luo, UNSW

Abstract:
Personalised recommender systems are widely used information filtering for information retrieval, where matrix factorisation (MF) has become popular as a model-based approach to personalised recommendation. Classical MF methods, which directly approximate low rank factor matrices by minimising some rating prediction criteria, do not achieve a satisfiable performance for the task of top-N recommendation. In this paper, we propose a novel MF method, namely BoostMF, that formulates factorisation as a learning problem and integrates boosting into factorisation. Rather than using boosting as a wrapper, BoostMF directly learns latent factors that are optimised toward the top-N recommendation. The proposed method is evaluated against a set of state-of-the-art methods on three popular public benchmark datasets. The experimental results demonstrate that the proposed method achieves significant improvement over these baseline methods for the task of top-N recommendation.

Scientific Track
Authors:
Debakar Shamanta, UTEP
Sheikh Motahar Naim, University of Texas at El Paso
Parang Saraf, VT
Naren Ramakrishnan, Virginia Tech
M. Shahriar Hossain, UTEP

Abstract:
Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated inthe probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style withbetter runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.

Scientific Track
Authors:
Markus Ring, HS Coburg
Florian Otto, www.hs-coburg.de
Martin Becker, www.informatik.uni-wuerzburg.de
Niebler Thomas, University of Wuerzburg
Dieter Landes, www.hs-coburg.de
Andreas Hotho, University of Kassel

Abstract:
A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects described by categorical attributes, defining meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge.In this paper, we propose an unsupervised distance measure forobjects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes.Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on different data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches.

Scientific Track
Authors:
Mathieu Blondel, NTT CS labs
Akinori Fujino, www.lab.ntt.co.jp
Naonori Ueda, www.lab.ntt.co.jp

Abstract:
Factorization machines are a generic framework which allows to mimic most factorization models simply by feature engineering. In this way, they combine the high predictive accuracy of factorization models with the flexibility of feature engineering. Unfortunately, factorization machines involve a non-convex optimization problem. Thus, we can only obtain a local solution, the quality of which depends on initialization. In this paper, we propose a convex formulation of factorization machines based on the nuclear norm. Our formulation imposes fewer restrictions on the learned model and is thus more general than the original formulation. To solve the corresponding optimization problem, we present an efficient globally-convergent two-block coordinate descent algorithm. Empirically, we demonstrate that our approach achieves comparable or better predictive accuracy than the original factorization machines on 4 recommendation tasks and scales to datasets with 10 million samples.

Scientific Track
Authors:
Vikas Raykar, IBM Research
Amrita Saha, IBM Research

Abstract:
A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model fitting), validation set (for model selection), and test set (for final model assessment). Predictive models can potentially evolve over time as developers improve their performance either by acquiring new data or improving the existing model. The main contribution of this paper is to discuss problems encountered and propose workflows to manage the allocation of newly acquired data into different sets in such dynamic model building and updating scenarios. Specifically we propose three different workflows (parallel dump, serial waterfall, and hybrid) for allocating new data into the existing training, validation, and test splits. Particular emphasis is laid on avoiding the bias due to the repeated use of the existing validation or the test set.

Scientific Track
Authors:

Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, Yoshua Bengio

Affiliation(s):
Université de Montréal

Abstract:
Back-propagation has been the workhorse of recent successes of deeplearning but it relies on infinitesimal effects (partial derivatives) in order to performcredit assignment. This could become a serious issue as one considersdeeper and more non-linear functions, e.g., consider the extreme case of nonlinearitywhere the relation between parameters and cost is actually discrete. Inspiredby the biological implausibility of back-propagation, a few approacheshave been proposed in the past that could play a similar credit assignment role. Inthis spirit, we explore a novel approach to credit assignment in deep networks thatwe call target propagation. The main idea is to compute targets rather than gradients,at each layer. Like gradients, they are propagated backwards. In a way thatis related but different from previously proposed proxies for back-propagationwhich rely on a backwards network with symmetric weights, target propagationrelies on auto-encoders at each layer. Unlike back-propagation, it can be appliedeven when units exchange stochastic bits rather than real numbers. We show thata linear correction for the imperfectness of the auto-encoders, called differencetarget propagation, is very effective to make target propagation actually work,leading to results comparable to back-propagation for deep networks with discreteand continuous units and denoising auto-encoders and achieving state ofthe art for stochastic networks.

Scientific Track
Authors:
Rina Okada, University of Tsukuba
Kazuto Fukuchi, University of Tsukuba
Jun Sakuma, University of Tsukuba

Abstract:
This paper presents an investigation of differentially private analysis of distance-based outliers.Outlier detection aims to identify instances that are apparently distant from other instances. Meanwhile, the objective of differential privacy is to conceal the presence (or absence) of any particular instance. Outlier detection and privacy protection are therefore intrinsically conflicting tasks.In this paper, we present differentially private queries for counting outliers that appear in a given subspace, instead of reporting the outliers detected. Our analysis of the global sensitivity of outlier counts reveals that regular global sensitivity-based methods can make the outputs too noisy, particularly when the dimensionality of the given subspace is high. Noting that the counts of outliers are typically expected to be small compared to the number of data, we introduce a mechanism based on the smooth upper bound of the local sensitivity.This study is the first trial to ensure differential privacy for distance-based outlier analysis. The experimentally obtained results show that our method achieves better utility than global sensitivity-based methods do.

Scientific Track
Authors:
Shuyang Lin, UIC
Qingbo Hu, www.uic.edu
Jingyuan Zhang, www.uic.edu
Philip Yu, UIC

Abstract:
Recently, user influence in social networks has been studied extensively. Many applications related to social influence depend on quantifying influence and finding the most influential users of a social network. Most existing work studies the global influence of users, i.e. the aggregated influence that a user has on the entire network. It is often overlooked that users may be significantly more influential to some audience groups than others. In this paper, we propose AudClus, a method to detect audience groups and identify group-specific influencers simultaneously. With extensive experiments on real data, we show that AudClus is effective in both the task of detecting audience groups and the task of identifying influencers of audience groups. We further show that AudClus makes possible for insightful observations on the relation between audience groups and influencers. The proposed method leads to various applications in areas such as viral marketing, expert finding, and data visualization.

Scientific Track
Authors:
Junting Ye, Stony Brook University
Akoglu Leman, Stonybrook

Abstract:
Online reviews are an important source for consumers to evaluate products/services on the Internet (e.g. Amazon, Yelp, etc.). However, more and more fraudulent reviewers write fake reviews to mis-lead users. To maximize their impact and share effort, many spam attacks are organized as campaigns, by a group of spammers. In this paper, we propose a new two-step method to discover spammer groups and their targeted products. First, we introduce NFS (Network Footprint Score), a new measure that quantifies the likelihood of products being spam campaign targets. Second, we carefully devise GroupStrainer to cluster spammers on a 2-hop subgraph induced by top ranking products. We demonstrate the efficiency and effectiveness of our approach on both synthetic and real-world datasets from two different domains with millions of products and reviewers. Moreover, we discover interesting strategies that spammers employ through case studies of our detected groups.

Scientific Track
Authors:

Rana Haber, Anand Rangarajan, Adrian Peter

Affiliation(s):
Florida Institute of Technology
University of Florida

Abstract:
The modus operandi for machine learning is to represent data as feature vectors and then proceed with training algorithms that seek to optimally partition the feature space S ⊂ R^n into labeled regions. This holds true even when the original data are functional in nature, i.e. curves or surfaces that are inherently varying over a continuum such as time or space. Functional data are often reduced to summary statistics, locally-sensitive characteristics, and global signatures with the objective of building comprehensive feature vectors that uniquely characterize each function. The present work directly addresses representational issues of functional data for supervised learning. We propose a novel discriminative interpolation framework wherein functional data in the same class are adaptively reconstructed to be more similar to each other, while simultaneously repelling nearest neighbor functional data in other classes. Akin to other recent nearest-neighbor metric learning paradigms like stochastic k-neighborhood selection and large margin nearest neighbors, our technique uses class-specific representations which gerrymander similar functional data in an appropriate parameter space. Experimental validation on several time series data sets establish the proposed discriminative interpolation framework as competitive or better in comparison to recent state-of-the-art techniques which continue to rely on the standard feature vector representation.

Scientific Track
Authors:
David Huang, The University of Auckland
Yun Sing Koh, University of Auckland
Gillian Dobbie, The University of Auckland
Albert Bifet, Huawei Noah's Ark Lab

Abstract:
Current methods in data streams that detect concept drifts in the underlying distribution of data look at the distribution difference using statistical measures based on mean and variance. Existing methods are unable to proactively approximate the probability of a concept drift occurring and predict future drift points. We extend the current drift detection design by proposing the use of historical drift trends to estimate the probability of expecting a drift at different points across the stream, which we term the expected drift probability. We offer empirical evidence that applying our expected drift probability with the state-of-the-art drift detector, ADWIN, we can improve the detection performance of ADWIN by significantly reducing the false positive rate. To the best of our knowledge, this is the first work that investigates this idea. We also show that our overall concept can be easily incorporated back onto incremental classifiers such as VFDT and demonstrate that the performance of the classifier is further improved.

Scientific Track
Authors:
Dirk Schäfer, University of Marburg
Eyke Huellermeier, Univ. of Paderborn

Abstract:
Label ranking is a specific type of preference learning problem, namely the problem of learning a model that maps instances to rankings over a finite set of predefined alternatives. These alternatives are identified by their name or label while not being characterized in terms of any properties or features that could be potentially useful for learning. In this paper, we consider a generalization of the label ranking problem that we call dyad ranking. In dyad ranking, not only the instances but also the alternatives are represented in terms of attributes. For learning in the setting of dyad ranking, we propose an extension of an existing label ranking method based on the Plackett-Luce model, a statistical model for rank data. Moreover, we present first experimental results confirming the usefulness of the additional information provided by the feature description of alternatives.

Scientific Track
Authors:
Asma DACHRAOUI, AgroParisTech & EDF
Alexis BONDU, EDF R & D
Antoine CORNUEJOLS, AgroParisTech

Abstract:
Classification of time series as early as possible is a most sought after goal. Indeed, in many application domains, the earliest the decision, the more rewarding it can be. Yet, often, gathering more information allows one to get a better decision. The optimization of this time vs. accuracy tradeoff must generally be solved online and is a complex problem.This paper presents a formal criterion that expresses this trade-off in all generality together with a generic sequential meta algorithm to solve it. This meta algorithm is interesting in two ways. First, it pinpoints where choices can (have to) be made to obtain a computable algorithm. As a result a wealth of algorithmic solutions can be found. Second, it seeks online the earliest time in the future where a minimization of the criterion can be expected. It thus goes beyond the classical approaches that myopically decide at each time step whether to make a decision or to postpone the call one more time step.After this general setting has been expounded, we study one simple declination of the meta-algorithm, and we show the results obtained on synthetic time series data sets chosen for their ability to test the robustness and properties of the technique.The general approach is vindicated by the experimental results, which allows us to point to promising perspectives.

Scientific Track
Authors:
Hsun-Ping Hsieh, National Taiwan Univrsity
Cheng-Te Li, Academia Sinica
Lin shou-de, National Taiwan University

Abstract:
Acquiring the knowledge about the volume of customers for places and time of interest has several benefits such as determining the locations of new retail stores and planning advertising strategies. This paper aims to estimate the number of potential customers of arbitrary query locations and any time of interest in modern urban areas. Our idea is to consider existing established stores as a kind of sensors because the near-by human activities of the retail stores characterize the geographical properties, mobility patterns, and social behaviors of the target customers. To tackle the task based on store sensors, we develop a method called Potential Customer Estimator (PCE), which models the spatial and temporal correlation between existing stores and query locations using geographical, mobility, and features on location-based social networks. Experiments conducted on NYC Foursquare and Gowalla data, with three popular retail stores, Starbucks, McDonald's, and Dunkin' Donuts exhibit superior results over state-of-the-art approaches.

Scientific Track
Authors:

Qingming Tang, Harry Yang, Jian Peng, JInbo Xu

Affiliation(s):
Toyota Technological Institute
UIUC

Abstract:
This paper considers the problem of estimating multiple related Gaussian graphical models from a p-dimensional dataset consisting of different classes. Our work is based upon the formulation of this problem as group graphical lasso. This paper proposes a novel hybrid covariance thresholding algorithm that can effectively identify zero entries in the precision matrices and split a large joint graphical lasso problem into small subproblems. Our hybrid covariance thresholding method is superior to existing uniform thresholding methods in that our method can split the precision matrix of each individual class using different partition schemes and thus split group graphical lasso into much smaller subproblems, each of which can be solved very fast. In addition, this paper establishes necessary and sufficient conditions for our hybrid covariance thresholding algorithm. The superior performance of our thresholding method is thoroughly analyzed and illustrated by a few experiments on simulated data and real gene expression data.

Scientific Track
Authors:
Aleksey Buzmakov, LORIA(INRIA-CNRS-U. Lorraine)
Sergei Kuznetsov, Higher School of Economics
Amedeo Napoli, Inria Nancy Grand Est / LORIA

Abstract:
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are not (anti-)monotonic, which makes it difficult to generate patterns satisfying these constraints.In this paper we introduce the notion of projection-antimonotonicity and theta-Sofia algorithm that allows efficient generation of the best patterns for some nonmonotonic constraints.In this paper we consider stability and delta-measure, which are nonmonotonic constraints, and apply them to interval tuple datasets. In the experiments, we compute best interval tuple patterns w.r.t. these measures and show the advantage of our approach over postfiltering approaches.

Scientific Track
Authors:

Chao Zhang, Shan Jiang, Yucheng Chen, Yidan Sun, Jiawei Han

Affiliation(s):
UIUC

Abstract:
Random walk with restart (RWR) is widely recognized as one of the most important node proximity measures for graphs, as it captures the holistic graph structure and is robust to noise in the graph. In this paper, we study a novel query based on the RWR measure, called the inbound top-k (Ink) query. Given a query node q and a number k, the Ink query aims at retrieving k nodes in the graph that have the largest weighted RWR scores to q. Ink queries can be highly useful for various applications such as traffic scheduling, disease treatment, and targeted advertising. Nevertheless, none of the existing RWR computation techniques can accurately and efficiently process the Ink query in large graphs. We propose two algorithms, namely Squeeze and Ripple, both of which can accurately answer the Ink query in a fast and incremental manner. To identify the top-k nodes, Squeeze iteratively performs matrix-vector multiplication and estimates the lower and upper bounds for all the nodes in the graph. Ripple employs a more aggressive strategy by only estimating the RWR scores for the nodes falling in the vicinity of q, which makes it even more efficient than Squeeze. The nodes outside the vicinity do not need to be evaluated because their RWR scores are propagated from the boundary of the vicinity and thus upper bounded. Ripple incrementally expands the vicinity until the top-k result set can be obtained. Our extensive experiments on real-life graph data sets show that Ink queries can retrieve interesting results, and the proposed algorithms are orders of magnitude faster than state-of-the-art method.

Scientific Track
Authors:
Paul Mineiro, Microsoft CISL
Nikos Karampatziakis, Microsoft CISL

Abstract:
Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.

Scientific Track
Authors:
Sebastian Pölsterl, Technische Universität München
Nassir Navab, www.tum.de
Amin Katouzian, www.cs.tum.edu

Abstract:
Survival analysis is a commonly used technique to identify important predictors of adverse events and develop guidelines for patient's treatment in medical research. When applied to large amounts of patient data, efficient optimization routines become a necessity. We propose efficient training algorithms for three kinds of linear survival support vector machines: 1) ranking-based, 2) regression-based, and 3) combined ranking and regression. We perform optimization in the primal using truncated Newton optimization and use order statistic trees to lower computational costs of training. We employ the same optimization technique and extend it for non-linear models too. Our results demonstrate the superiority of our proposed optimization scheme over existing training algorithms, which fail due to their inherently high time and space complexities when applied to large datasets. We validate the proposed survival models on 6 real-world datasets, and show that pure ranking-based approaches outperform regression and hybrid models.

Scientific Track
Authors:
Matt Revelle, George Mason University
Carlotta Domeniconi, George Mason University
Mack Sweeney, George Mason University
Aditya Johri, George Mason University

Abstract:
Community detection in networks is a broad problem with many proposed solutions. Existing methods frequently make use of edge density and node attributes; however, the methods ultimately have different definitions of community and build strong assumptions about community features into their models. We propose a new method for community detection, which estimates both per-community feature distributions (topics) and per-node community membership. Communities are modeled as connected subgraphs with nodes sharing similar attributes. Nodes may join multiple communities and share common attributes with each. Communities have an associated probability distribution over attributes and node attributes are modeled as draws from a mixture distribution. We make two basic assumptions about community structure: communities are densely connected and have a small network diameter. These assumptions inform the estimation of community topics and membership assignments without being too prescriptive. We present competitive results against state-of-the-art methods for finding communities in networks constructed from NSF awards, the DBLP repository, and the Scratch online community.

Scientific Track
Authors:
Vinay Jethava, ETH Zurich
Niko Beerenwinkel, ETH Zurich

Abstract:
This paper considers the problem of finding large dense subgraphs in relational graphs, i.e., a set of graphs which share a common vertex set. We present an approximation algorithm for finding the densest common subgraph in a relational graph setbased on an extension of Charikar's linear programming basedapproach for finding the densest subgraph in a single graph. We also present a simple greedyheuristic which can be implemented efficiently foranalysis of larger graphs. We give graph dependent bounds on thequality of the solutions returned by our methods. Lastly,we show by empirical evaluation on severalbenchmark datasets that our method out-performs existing approaches.

Scientific Track
Authors:
Ayan Acharya, University of Texas at Austin
Dean Teffer, Applied Research Laboratories
Jette Henderson, Applied Research Laboratories
Marcus Tyler, Applied Research Laboratories
Mingyuan Zhou, University of Texas at Austin
Joydeep Ghosh, University of Texas at Austin

Abstract:
Developing models to discover, analyze, and predict clusters within networked entities is an area of active and diverse research. However, many of the existing approaches do not take into consideration pertinent auxiliary information. This paper introduces Joint Gamma Process Poisson Factorization (J-GPPF) to jointly model network and side-information. J-GPPF naturally fits sparse networks, accommodates separately clustered side information in a principled way, and effectively addresses the computational challenges of analyzing large networks. Evaluated with hold-out link prediction performance on sparse networks (both synthetic and real-world) with side information, J-GPPF is shown to clearly outperform algorithms that only model the network adjacency matrix.

Scientific Track
Authors:

Karim Abou-Moustafa, Dale Schuurmans

Affiliation(s):
University of Alberta

Abstract:
We are interested in the following questions. Given a finite data set S, with neither labels nor side information, and an unsupervised learning algorithm A, can the generalization of A be assessed on S? Similarly, given two unsupervised learning algorithms, A1 and A2, for the same learning task, can one assess whether one will generalize "better'' on future data drawn from the same source as S? In this paper, we develop a general approach to answering these questions in a reliable and efficient manner using mild assumptions on A. We first propose a concrete generalization criterion for unsupervised learning that is analogous to prediction error in supervised learning. Then, we develop a computationally efficient procedure that realizes the generalization criterion on finite data sets, and propose and extension for comparing the generalization of two algorithms on the same data set. We validate the overall framework on algorithms for clustering and dimensionality reduction (linear and nonlinear).

Scientific Track
Authors:
Miettinen Pauli, Max-Planck-Institut f�r Informatik

Abstract:
Matrix factorizations are a popular tool to mine regularities from data. There are many ways to interpret the factorizations, but one particularly suited for data mining utilizes the fact that a matrix product can be interpreted as a sum of rank-1 matrices. Then the factorization of a matrix becomes the task of finding a small number of rank-1 matrices, sum of which is a good representation of the original matrix. Seen this way, it becomes obvious that many problems in data mining can be expressed as matrix factorizations with correct definitions of what a rank-1 matrix and a sum of rank-1 matrices mean. This paper develops a unified theory, based on generalized outer product operators, that encompasses many pattern set mining tasks. The focus is on the computational aspects of the theory and studying the computational complexity and approximability of many problems related to generalized matrix factorizations. The results immediately apply to a large number of data mining problems, and hopefully allow generalizing future results and algorithms, as well.

Scientific Track
Authors:
Mohadeseh Ganji, University of Melbourne
Abbas Seifi, www.aut.ac.ir
Hosein Alizadeh, www.iust.ac.ir
Bailey James, University of Melbourne
Peter Stuckey, www.unimelb.edu.au

Abstract:
Detecting the underlying community structure of networks is an important problem in complex network analysis. Modularity is a well-known quality function introduced by Newman, that measures how vertices in a community share more edges than what would be expected in a randomized network. However, this limited view on vertex similarity leads to limits in what can be resolved by modularity.To overcome these limitations, we propose a generalized modularity measure called GM which has a more sophisticated interpretation of vertex similarity.In particular, GM also takes into account the number of longer paths between vertices, compared to what would be expected in a randomized network.We also introduce a unified version of GM which detects communities of unipartite and (near-)bipartite networks without knowing the structure type in advance.Experiments on different synthetic and real data sets, demonstrate GM performs strongly in comparison to several existing approaches, particularly for small-world networks.

Scientific Track
Authors:
Benjamin Fish, UIC
Rajmonda Caceres, MIT Lincoln Laboratory

Abstract:
Oversampling is a common characteristic of data representing dynamic networks. It introduces noise into representations of dynamic networks, but there has been little work so far to compensate for it. Oversampling can affect the quality of many important algorithmic problems on dynamic networks, including link prediction. Link prediction asks to predict edges that will be added to the network given previous snapshots of the network. We show that not only does oversampling affect the quality of link prediction, but that we can use link prediction to recover from the effects of oversampling. We also introduce a novel generative model of noise in dynamic networks that represents oversampling. We demonstrate the results of our approach on both synthetic and real-world data.

Scientific Track
Authors:
Yinjie Huang, University of Central Florida
Michael Georgiopoulos, University of Central Florida
Georgios Anagnostopoulos, Florida Institute of Technology

Abstract:
In this paper we introduce a novel hash learning framework that has two main distinguishing features, when compared to past approaches. First, it utilizes codewords in the Hamming space as ancillary means to accomplish its hash learning task. These codewords, which are inferred from the data, attempt to capture grouping aspects of the data's hash codes. Secondly and more importantly, the same framework is capable of addressing supervised, unsupervised and, even, semi-supervised hash learning tasks. A series of comparative experiments focused on content-based image retrieval highlights its performance advantages.

Scientific Track
Authors:
Xiao Bian, www.ncsu.edu
Xia Ning, iupui
Guofei Jiang, www.nec-labs.com

Abstract:
Sparse coding plays a key role in high dimensional data analysis. One critical challenge of sparse coding is to design a dictionary that is both adaptive to the training data and generalizable to data of same type. In this paper, we propose a novel dictionary learning algorithm to build an adaptive dictionary regularized by a priori over-completed dictionary. This hence leads to a hierarchical sparse structure on the a priori dictionary over the learned dictionary, and a hierarchical sparse structure on the learned dictionary over the data, respectively. We demonstrate that this approach reduces overfitting and enhances the generalizability of the learned dictionary. Moreover, the learned dictionary is optimized to adapt thegiven data and results in a more compact dictionary and a more robust sparse representation.We apply the hierarchical sparse dictionary learning approach on both synthetic data and real-world high-dimensional time series data, where the time series dataexhibit high heterogeneity in their nature, e.g., continuous vs discrete, smooth vs non-smooth, etc. The experimental results on a real dataset from a chemical plant process monitoring system demonstrate that the proposed algorithm can successfully characterize the heterogeneity of the given data, and leads to a better and more robust dictionary.

Scientific Track
Authors:

Anveshi Charuvaka, Rangwala Huzefa

Affiliation(s):
George Mason University

Abstract:
Hierarchical Classification (HC) is an important problem with a wide range of application in domains such as music genre classification, protein function classification and document classification. Although several innovative classification methods have been proposed to address HC, most of them are not scalable to web-scale problems. While simple methods such as top-down "pachinko'' style classification and flat classification scale well, they either have poor classification performance or do not effectively use the hierarchical information. Current methods that incorporate hierarchical information in a principled manner are often computationally expensive and unable to scale to large datasets. In the current work, we adopt a cost-sensitive classification approach to the hierarchical classification problem by defining misclassification cost based on the hierarchy. This approach effectively decouples the models for various classes, allowing us to efficiently train effective models for large hierarchies in a distributed fashion.

Scientific Track
Authors:
Koh Takeuchi, NTT
Yoshinobu Kawahara, Osaka University
Tomoharu Iwata, NTT

Abstract:
We often encounter situations in supervised learning where there exist possibly overlapping groups that consist of more than two parameters. For example, we might work on parameters that correspond to words expressing the same meaning, music pieces in the same genre, and books released in the same year. Based on such auxiliary information, we could suppose that parameters in a group have similar roles in a problem and similar values. In this paper, we propose the Higher Order Fused (HOF) regularization that can incorporate smoothness among parameters with group structures as prior knowledge in supervised learning. We define the HOF penalty as the Lov\'{a}sz extension of a submodular higher-order potential function, which encourages parameters in a group to take similar estimated values when used as a regularizer. Moreover, we develop an efficient network flow algorithm for calculating the proximity operator for the regularized problem. We investigate the empirical performance of the proposed algorithm by using synthetic and real-world data.

Scientific Track
Authors:
Nicolas Schilling, University of Hildesheim
Martin Wistuba, University of Hildesheim
Lucas Drumond, University of Hildesheim
Schmidt-Thieme Lars, University of Hildesheim

Abstract:
In machine learning, hyperparameter optimization is a challenging task that is usually approached by experienced practitioners or in a computationally expensive brute-force manner such as grid-search. Therefore, recent research proposes to use observed hyperparameter performance on already solved problems (i.e. data sets) in order to speed up the search for promising hyperparameter configurations in the sequential model based optimization framework.In this paper, we propose multilayer perceptrons as surrogate models as they are able to model highly nonlinear hyperparameter response surfaces. However, since interactions of hyperparameters, data sets and metafeatures are only implicitly learned in the subsequent layers, we improve the performance of multilayer perceptrons by means of an explicit factorization of the interaction weights and call the resulting model a factorized multilayer perceptron. Additionally, we evaluate different ways of obtaining predictive uncertainty, which is a key ingredient for a decent tradeoff between exploration and exploitation. Our experimental results on two public meta data sets demonstrate the efficiency of our approach compared to a variety of published baselines. For reproduction purposes, we make our data sets and all the program code publicly available on our supplementary webpage.

Scientific Track
Authors:
Martin Wistuba, University of Hildesheim
Nicolas Schilling, University of Hildesheim
Schmidt-Thieme Lars, University of Hildesheim

Abstract:
The optimization of hyperparameters is often done manually or exhaustively but recent work has shown that automatic methods can optimize hyperparameters faster and even achieve better final performance. Sequential model-based optimization (SMBO) is the current state of the art framework for automatic hyperparameter optimization. Currently, it consists of three components: a surrogate model, an acquisition function and an initialization technique. We propose to add a fourth component, a way of pruning the hyperparameter search space which is a common way of accelerating the search in many domains but yet has not been applied to hyperparameter optimization. We propose to discard regions of the search space that are unlikely to contain better hyperparameter configurations by transferring knowledge from past experiments on other data sets as well as taking into account the evaluations already done on the current data set.Pruning as a new component for SMBO is an orthogonal contribution but nevertheless we compare it to surrogate models that learn across data sets and extensively investigate the impact of pruning with and without initialization for various state of the art surrogate models. The experiments are conducted on two newly created meta-data sets which we make publicly available. One of these meta-data sets is created on 59 data sets using 19 different classifiers resulting in a total of about 1.3 million experiments. This is by more than four times larger than all the results collaboratively collected by OpenML.

Scientific Track
Authors:
Yuxiao Dong, University of Notre Dame
Pinelli Fabio, IBM Reseach Center
Yiannis Gkoufas, IBM Research - Ireland
Zubair Nabi, IBM Research - Ireland
Francesco Calabrese, IBM Research - Ireland
Nitesh Chawla, University of Notre Dame

Abstract:
The pervasiveness and availability of mobile phone data offer the opportunity of discovering usable knowledge about crowd behavior in urban environments. Cities can leverage such knowledge to provide better services (e.g., public transport planning, optimized resource allocation) and safer environment. Call Detail Record (CDR) data represents a practical data source to detect and monitor unusual events considering the high level of mobile phone penetration, compared with GPS equipped and open devices. In this paper, we propose a methodology that is able to detect unusual events from CDR data, which typically has low accuracy in terms of space and time resolution. Moreover, we introduce a concept of unusual event that involves a large amount of people who expose an unusual mobility behavior. Our careful consideration of the issues that come from coarse-grained CDR data ultimately leads to a completely general framework that can detect unusual crowd events from CDR data effectively and efficiently. Through extensive experiments on real-world CDR data for a large city in Africa, we demonstrate that our method can detect unusual events with 16% higher recall and over 10 times higher precision, compared to state-of-the-art methods. We implement a visual analytics prototype system to help end users analyze detected unusual crowd events to best suit different application scenarios. To the best of our knowledge, this is the first work on the detection of unusual events from CDR data with considerations of its temporal and spatial sparseness and distinction between user unusual activities and daily routines.

Scientific Track
Authors:

Shaona Ghosh, Adam Prugel-Bennett

Affiliation(s):
University of Southampton

Abstract:
We develop an online learning algorithm for bandits on a graph with side information where there is an underlying Ising distribution over the vertices at low temperatures. We are motivated from practical settings where the graph state in a social or a computer hosts network (potentially) changes at every trial; intrinsically partitioning the graph thus requiring the learning algorithm to play the bandit from the current partition. Our algorithm essentially functions as a two stage process. In the first stage it uses "minimum-cut" as the regularity measure to compute the state of the network by using the side label received and acting as a graph classifier. The classifier internally uses a polynomial time linear programming relaxation technique that incorporates the known information to predict the unknown states. The second stage ensures that the bandits are sampled from the appropriate partition of the graph with the potential for exploring the other part. We achieve this by running the adversarial multi armed bandit for the edges in the current partition while exploring the "cut'' edges. We empirically evaluate the strength of our approach through synthetic and real world datasets. We also indicate the potential for a linear time exact algorithm for calculating the max-flow as an alternative to the linear programming relaxation, besides promising bounded mistakes/regret in the number of times the "cut'" changes.

Scientific Track
Authors:

Maria-Irina Nicolae, Gaussier Eric, Amaury Habrard, Marc Sebban

Affiliation(s):
Jean Monnet University
Universite Joseph Fourier

Abstract:
The importance of metrics in machine learning has attracted a growing interest for distance and similarity learning. We study here this problem in the situation where few labeled data (and potentially few unlabeled data as well) is available, a situation that arises in several practical contexts. We also provide a complete theoretical analysis of the proposed approach. It is indeed worth noting that the metric learning research field lacks theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a learned metric. The theoretical framework of (ε,γ,τ)-good similarity functions has been one of the first attempts to draw a link between the properties of a similarity function and those of a linear classifier making use of it. In this paper, we extend this theory to a method where the metric and the separator are jointly learned in a semi-supervised way, setting that has not been explored before, and provide a theoretical analysis of this joint learning via Rademacher complexity. Experiments performed on standard datasets show the benefits of our approach over state-of-the-art methods.

Scientific Track
Authors:
Ziqiang Shi, Fujitsu Research & Development
Rujie Liu, Fujitsu Research & Development

Abstract:
In this work, we generalized and unified two recent completely different works of Jascha and Lee respectively into one by proposing the proximal stochastic Newton-type gradient (PROXTONE) method for optimizing the sums of two convex functions: one is the average of a huge number of smooth convex functions, and the other is a nonsmooth convex function. Our PROXTONE incorporates second order information to obtain stronger convergence results, that it achieves a linear convergence rate not only in the value of the objective function, but also for the solution. The proofs are simple and intuitive, and the results and technique can be served as a initiate for the research on the proximal stochastic methods that employ second order information. The methods and principles proposed in this paper can be used to do logistic regression, training of deep neural network and so on. Our numerical experiments shows that the PROXTONE achieves better computation performance than existing methods.

Scientific Track
Authors:

Duc Luu, Peng Lim

Affiliation(s):
Singapore Management Universit

Abstract:
Diffusion is an important dynamics that helps spreading informationwithin an online social network. While there are already numerous models for single item diffusion, few have studied diffusion of multiple items, especially when items can interact with one another due to their inter-similarity. Moreover, the well-known homophily effect is rarely considered explicitly in the existing diffusion models. This work therefore fills this gap by proposing a novel modelcalled Topic level Interaction Homophily Aware Diffusion (TIHAD) to include both latent factor level interaction among items and homophily factor in diffusion. The model determines item interaction based on latent factors and edge strengths based on homophily factor in the computation of social influence. An algorithm for training TIHAD model is also proposed. Our experiments on synthetic andreal datasets show that: (a) homophily increases diffusion significantly, and (b) item interaction at topic level boosts diffusion among similar items. A case study on hashtag diffusion in Twitter also shows that TIHAD outperforms the baseline model in the hashtag adoption prediction task.

Scientific Track
Authors:
Pengtao Xie, 1987

Abstract:
Learning a proper distance metric is of vital importance for many distance based applications. Distance metric learning aims to learn a set of latent factors based on which the distances between data points can be effectively measured. The number of latent factors incurs a tradeoff: a small amount of factors are not powerful and expressive enough to measure distances while a large number of factors cause high computational overhead. In this paper, we aim to achieve two seemingly conflicting goals: keeping the number of latent factors to be small for the sake of computational efficiency, meanwhile making them as effective as a large set of factors. The approach we take is to impose a diversity regularizer over the latent factors to encourage them to be uncorrelated, such that each factor can capture some unique information that is hard to be captured by other factors. In this way, a small amount of latent factors can be sufficient to capture a large proportion of information, which retains computational efficiency while preserving the effectiveness in measuring distances. Experiments on retrieval, clustering and classification demonstrate that a small amount of factors learned with diversity regularization can achieve comparable or even better performance compared with a large factor set learned without regularization.

Scientific Track
Authors:
Guillaume Cleuziou, LIFO Université d'Orléans
Gaël Dias, GREYC

Abstract:
In this paper, we propose a new methodology for semi-supervised acquisition of lexical taxonomies. Our approach is based on the theory of pretopology that offers a powerful formalism to model semantic relations and transforms a list of terms into a structured term space by combining different discriminant criteria. In order to learn a parameterized pretopological space, we define the Learning Pretopological Spaces strategy based on genetic algorithms. In particular, rare but accurate pieces of knowledge are used to parameterize the different criteria defining the pretopological term space. Then, a structuring algorithm is used to transform the pretopological space into a lexical taxonomy. Results over three standard datasets evidence improved performances against state-of-the-art associative and pattern-based approaches.

Scientific Track
Authors:
Rohit Kumar, Université Libre de Bruxelles
Toon Calders, Université Libre de Bruxelles
Gionis Aristides, Aalto University
Tatti Nikolaj, Aalto University

Abstract:
Large networks are being generated by applications that keep track of relationships between different data entities. Examples include online social networks recording interactions between individuals, sensor networks logging information exchanges between sensors, and more. There is a large body of literature on computing exact or approximate properties on large networks, although most methods assume static net- works. On the other hand, in most modern real-world applications, net- works are highly dynamic and continuously interactions along existing connections are generated. Furthermore, it is desirable to consider that old edges become less important, and their contribution to the current view of the network diminishes over time.We study the problem of maintaining the neighborhood profile of each node in an interaction network. Maintaining such a profile has applications in modeling network evolution and monitoring the importance of the nodes of the network over time. We present an online streaming algorithm to maintain neighborhood profiles in the sliding-window model. The algorithm is highly scalable as it permits parallel processing and the computation is node centric, hence it scales easily to very large networks on a distributed system. We present results from both serial and parallel implementations of the algorithm for different social networks.

Scientific Track
Authors:

Konstantinos Sechidis, Gavin Brown

Affiliation(s):
University of Manchester

Abstract:
The importance of Markov blanket discovery algorithms is twofold: as the main building block in constraint-based structure learning of Bayesian network algorithms and as a technique to derive the optimal set of features in filter feature selection approaches. Equally, learning from partially labelled data is a crucial and demanding area of machine learning, and extending techniques from fully to partially supervised scenarios is a challenging problem. While there are many different algorithms to derive the Markov blanket of fully supervised nodes, the partially-labelled problem is far more challenging, and there is a lack of principled approaches in the literature. Our work derives a generalization of the conditional tests of independence for partially labelled binary target variables, which can handle the two main partially labelled scenarios: positive-unlabelled and semi-supervised. The result is a significantly deeper understanding of how to control false negative errors in Markov Blanket discovery procedures and how unlabelled data can help.

Scientific Track
Authors:
Wojciech Czarnecki, Jagiellonian University
Rafal Józefowicz, Google
Jacek Tabor, Jagiellonian University

Abstract:
Representation learning is currently a very hot topic in modern machine learning, mostly due to the great success of the deep learning methods.In particular low-dimensional representation which discriminates classes can not only enhance the classification procedure, but also make it faster, while contrary to the high-dimensional embeddings can be efficiently used for visual based exploratory data analysis.In this paper we propose Maximum Entropy Linear Manifold (MELM), a multidimensional generalization of Multithreshold Entropy Linear Classifier model which is able to find a low-dimensional linear data projection maximizing discriminativeness of projected classes. As a result we obtain a linear embedding which can be used for classification, class aware dimensionality reduction and data visualization. MELM provides highly discriminative 2D projections of the data which can be used as a method for constructing robust classifiers.We provide both empirical evaluation as well as some interesting theoretical properties of our objective function such us scale and affine transformation invariance, connections with PCA and bounding of the expected balanced accuracy error.