Accepted Papers

Industrial Track
Authors:
Enda Barrett, Schneider Electric
Stephen Linder, Schneider Electric

Abstract:
Recent high profile developments of autonomous learning thermostats by companies such as Nest Labs and Honeywell have brought to the fore the possibility of ever greater numbers of intelligent devices permeating our homes and working environments into the future. However, the specific learning approaches and methodologies utilised by these devices have never been made public. In fact little information is known as to the specifics of how these devices operate and learn about their environments or the users who use them. This paper proposes a suitable learning architecture for such an intelligent thermostat in the hope that it will benefit further investigation by the research community. Our architecture comprises a number of different learning methods each of which contributes to create a complete autonomous thermostat capable of controlling a HVAC system. A novel state action space formalism is proposed to enable a Reinforcement Learning agent to successfully control the HVAC system by optimising both occupant comfort and energy costs. Our results show that the learning thermostat can achieve cost savings of 10% over a programmable thermostat, whilst maintaining high occupant comfort standards.

Industrial Track
Authors:

George Forman, Hila Nachlieli, Renato Keshet

Affiliation(s):
Hewlett-Packard Labs

Abstract:
Our business users have often been frustrated with clustering results that do not suit their purpose; when trying to discover clusters of product complaints, the algorithm may return clusters of product models instead. The fundamental issue is that complex text data can be clustered in many different ways, and, really, it is optimistic to expect relevant clusters from an unsupervised process, even with parameter tinkering.We studied this problem in an interactive context and developed an effective solution that re-casts the problem formulation, radically different from traditional or semi-supervised clustering. Given training labels of some known classes, our method incrementally proposes complementary clusters. In tests on various business datasets, we consistently get relevant results and at interactive time scales. This paper describes the method and demonstrates its superior ability using publicly available datasets. For automated evaluation, we devised a unique cluster evaluation framework to match the business user's utility.

Industrial Track
Authors:
Romain Guigourès, Zalando
Dominique Gay, Orange
Boullé Marc, Orange Labs
Fabrice Clérot, Orange Labs
Fabrice Rossi, SAMM EA 4543

Abstract:
Call Detail Records (CDRs) are data recorded by telecommunications companies, consisting of basic informations related to several dimensions of the calls made through the network: the source, destination, date and time of calls. CDRs data analysis has received much attention in the recent years since it might reveal valuable information about human behavior. It has shown high added value in many application domains like e.g., communities analysis or network planning.In this paper, we suggest a generic methodology based on data grid models for summarizing information contained in CDRs data. The method is based on a parameter-free estimation of the joint distribution of the variables that describe the calls. We also suggest several well-founded criteria that allows one to browse the summary at various granularities and to explore the summary by means of insightful visualizations. The method handles network graph data, temporal sequence data as well as user mobility data stemming from original CDRs data. We show the relevance of our methodology on real-world CDRs data from Ivory Coast for various case studies, like network planning strategy and yield management pricing strategy.

Industrial Track
Authors:
Hani Neuvirth, Microsoft
Yehuda Yehuda Finkelstein, Microsoft
Amit Hilbuch, Microsoft
Shai Nahum, Microsoft
Daniel Alon, Microsoft
Elad Yom-Tov, Microsoft Research

Abstract:
Cloud computing resources are sometimes hijacked for fraudulent use. While some fraudulent use manifests as a small-scale resource consumption, a more se-rious type of fraud is that of fraud storms, which are events of large-scale fraudu-lent use. These events begin when fraudulent users discover new vulnerabilities in the sign up process, which they then exploit in mass. The ability to perform early detection of these storms is a critical component of any cloud-based public computing system.In this work we analyze telemetry data from Microsoft Azure to detect fraud storms and raise early alerts on sudden increases in fraudulent use. The use of machine learning approaches to identify such anomalous events involves two in-herent challenges: the scarcity of these events, and at the same time, the high fre-quency of anomalous events in cloud systems.We compare the performance of a supervised approach to the one achieved by an unsupervised, multivariate anomaly detection framework. We further evaluate the system performance taking into account practical considerations of robustness in the presence of missing values, and minimization of the model’s data collection period.This paper describes the system, as well as the underlying machine learning algo-rithms applied. A beta version of the system is deployed and used to continuous-ly control fraud levels in Azure.

Industrial Track
Authors:
Thanh Lam Hoang, IBM
Eric Bouillet, IBM Research

Abstract:
Given a set of historical bus trajectories D and a partially observed bus trajectory S up to position l on the bus route, kernel regression (KR) is a non-parametric approach which predicts the arrival time of the bus at location l+h (h > 0) by averaging the arrival times observed at same location in the past. The KR method does not weights the historical data equally but it gives more preference to more similar trajectories in the historical data. This method has been shown to outperform the baseline methods such as linear regression or k-nearest neighbour algorithms for bus arrival time prediction problems. However, the performance of KR is very sensitive to the method of evaluating similarity between trajectories. General kernel regression algorithm looks back to the entire trajectory for evaluating similarity. In the case of bus arrival time prediction, this approach does not work well when outdated part of the trajectories does not reflect the most recent behaviour of the buses. In order to solve this issue, we propose an approach that considers only recent part of the trajectories in a sliding window for evaluating the similarity between them. The approach introduces a set of parameters corresponding to the window lengths at every position along the bus route determining how long we should look back into the past for evaluating the similarity between trajectories. These parameters are automatically learned from training data. Nevertheless, parameter learning is a time-consuming process given large training data (at least quadratic in the training size). Therefore, we proposed an approximation algorithm with guarantees on error bounds to learn the parameters efficiently. The approximation algorithm is an order of magnitude faster than the exact algorithm. In an experiment with a real-world application deployed for Dublin city, our approach significantly reduced the prediction error compared to the state of the art kernel regression algorithm.

Industrial Track
Authors:
Vojtech Franc, Czech Technical University in
Michal Sofka, CISCO
Karel Bartos, Cisco Systems

Abstract:
We address the problem of learning a detector of malicious behavior in network traffic. The malicious behavior is detected based on the analysis of network proxy logs that capture malware communication between client and server computers. The conceptual problem in using the standard supervised learning methods is the lack of sufficiently representative training set containing examples of malicious and legitimate communication. Annotation of individual proxy logs is an expensive process involving security experts and does not scale with constantly evolving malware. However, weak supervision can be achieved on the level of properly defined bags of proxy logs by leveraging internet domain black lists, security reports, and sandboxing analysis. We demonstrate that an accurate detector can be obtained from the collected security intelligence data by using a Multiple Instance Learning algorithm tailored to the Neyman-Pearson problem. We provide a thorough experimental evaluation on a large corpus of network communications collected from various company network environments.

Industrial Track
Authors:

Christian Bockermann, Jens Buss, Alexey Egorov, Kai Brügge, Katharina Morik

Affiliation(s):
TU Dortmund University

Abstract:
Experiments in high-energy astroparticle physics produce large amounts of data as continuous high-volume streams. Gaining insights from the observed data poses a number of challenges to data analysis at various steps in the analysis chain of the experiments. Machine learning methods have already cleaved their way selectively at some particular stages of the overall data mangling process.In this paper we investigate the deployment of machine learning methods at various stages of the data analysis chain in a gamma-ray astronomy experiment. Aiming at online and real-time performance, we build up on prominent software libraries and discuss the complete cycle of data processing from raw-data capturing to high-level classification using a data-flow based rapid-prototyping environment. In the context of a gamma-ray experiment, we review user requirements in this interdisciplinary setting and demonstrate the applicability of our approach in a real-world setting to provide results from high-volume data streams in real-time performance.

Industrial Track
Authors:
Karel Bartos, Cisco Systems
Michal Sofka, CISCO

Abstract:
The goal of domain adaptation is to solve the problem of different joint distribution of observation and labels in the training and testing data sets. This problem happens in many practical situations such as when a malware detector is trained from labeled datasets at certain time point but later evolves to evade detection. We solve the problem by introducing a new representation which ensures that a conditional distribution of the observation given labels is the same. The representation is computed for bags of samples (network traffic logs) and is designed to be invariant under shifting and scaling of the feature values extracted from the logs and under permutation and size changes of the bags. The invariance of the representation is achieved by relying on a self-similarity matrix computed for each bag. In our experiments, we will show that the representation is effective for training detector of malicious traffic in large corporate networks. Compared to the case without domain adaptation, the recall of the detector improves from 0.81 to 0.88 and precision from 0.998 to 0.999.

Industrial Track
Authors:
Jens Schreiter, Robert Bosch GmbH
Duy Nguyen-Tuong, Robert Bosch GmbH
Mona Eberts, Robert Bosch GmbH
Bastian Bischoff, Robert Bosch GmbH
Heiner Markert, Robert Bosch GmbH
Marc Toussaint, University of Stuttgart

Abstract:
In this paper, the problem of safe exploration in the active learning context is considered. Safe exploration is especially important for data sampling from technical and industrial systems, e.g. combustion engines and gas turbines, where critical and unsafe measurements need to be avoided. The objective is tolearn data-based regression models from such technical systems using a limited budget of measured, i.e. labelled, points while ensuring that critical regions of the considered systems are avoided during measurements. We propose an approach for learning such models and exploring new data regions based on Gaussian processes (GP’s). In particular, we employ a problem specific GP classifier to identify safe and unsafe regions, while using a differential entropy criterion for exploring relevant data regions. A theoretical analysis is shown for the proposed algorithm, where we provide an upper bound for the probability of failure. To demonstrate the efficiency and robustness of our safe exploration scheme in the active learning setting, we test the approach on a policy exploration task for the inverse pendulum hold up problem.

Industrial Track
Authors:
Helena Aidos, Instituto de Telecomunicações
André Lourenço, Instituto de Telecomunicações
Diana Batista, Instituto de Telecomunicações
Samuel Rota Bulò, Fondazione Bruno Kessler
Ana Fred, Instituto de Telecomunicações

Abstract:
Pervasive technology is changing the paradigm of healthcare, by empowering users and families with the means for self-care and general health management. However, this requires accurate algorithms for information processing and pathology detection. Accordingly, this paper presents a system for electrocardiography (ECG) pathology classification, relying on a novel semi-supervised consensus clustering algorithm, which finds a consensus partition among a set of baseline clusterings that have been collected for the data under consideration. In contrast to typical unsupervised scenarios, our solution allows exploiting partial prior knowledge of a subset of data points. Our method is built upon the evidence accumulation framework to efficaciously sidestep the cluster correspondence problem. Computationally, the consensus partition is sought by exploiting a result known as Baum-Eagon inequality in the probability domain, which allows for a step-size-free optimization. Experiments on standard benchmark datasets show the validity of our method over the state-of-the-art. In the real world problem of ECG pathology classification, the proposed method achieves comparable performance to supervised learning methods using as few as 20% labeled data points.

Industrial Track
Authors:
Phiradet Bangcharoensap, Tokyo Institute of Technology
Hayato Kobayashi, Yahoo Japan Corporation
Nobuyuki Shimizu, Yahoo Japan Corporation
Satoshi Yamauchi, Yahoo Japan Corporation
Tsuyoshi Murata, Tokyo Institute of Technology

Abstract:
We analyze a social graph of online auction users and propose an online auction fraud detection approach. In this paper, fraudsters are those who participate in their own auction in order to drive up the final price. They tend to frequently bid in auctions hosted by fraudulent sellers, who work in the same collusion group. Our graph-based semi-supervised learning approach for online auction fraud detection is based on this social interaction of fraudsters. Auction users and their transactions are represented as a social interaction graph. Given a small set of known fraudsters, our aim was to detect more fraudsters based on the hypothesis that strong edges between fraudsters frequently exist in online auction social graphs. Detecting fraudsters who work in collusion with known fraudsters was our primary goal. We also found that weighted degree centrality is a distinct feature that separates fraudsters and legitimate users. We actively used this fact to detect fraud. To this end, we extended the modified adsorption model by incorporating the weighted degree centrality of nodes. The results, from real world data, show that by integrating the weighted degree centrality to the model can significantly improve accuracy.

Industrial Track
Authors:

Michal Aharon, Eshcar Hillel, amit Kagian, Ronny Lempel, Hayim Makabee, Raz Nissim

Affiliation(s):
Yahoo Labs

Abstract:
As consumers of television are presented with a plethora of available programming, improving recommender systems in this domain is becoming increasingly important. Television sets, though, are often shared by multiple users whose tastes may greatly vary. Recommendation systems are challenged by this setting, since viewing data is typically collected and modeled per device, aggregating over its users and obscuring their individual tastes.This paper tackles the challenge of TV recommendation, specifically aiming to provide recommendations for the next program to watch following the currently watched program on the device. We present an empirical evaluation of several recommendation methods over large-scale, real-life TV viewership data. Our extensions of common state-of-the-art recommendation methods, exploiting the current watching context, demonstrate a significant improvement in recommendation quality.