# General issues¶

This page describes the general issues related to the application of data-mining techniques, other than preprocessing. More information (or references for finding information) regarding individual algorithms is found elsewhere in the wiki.

In general, data mining techniques can be categorized in multiple ways. *Predictive*, or *supervised*, techniques build models that can predict values of a *class* label. Existing class labels determined through some external means serve as the *ground truth* for building these modes. Typical examples of techniques in this category are Decision Trees, and Support Vector Machines. *Descriptive*, or *unsupervised*, techniques try to detect the inherent structure of the datasets, for example, groups of similar items that are distinct from other groups. Approaches such as k-means clustering or the Expectation Maximization algorithm fall into this category. In addition, there are approaches for detection of patterns, association rules, and outliers. For an introduction to data mining, there are a number of excellent texts available, for example:

**TBD**

An exploratory investigation of the exoplanet (and any other) data requires the application of a large variety of techniques, such as those listed below. In addition, and in view of the large size of astronomical datasets available even today, scalable approaches will be investigated as well.

# Connection to other teams¶

The application of data-mining techniques to any data requires close collaboration with the domain specialists, as well as the visualization group (for visualization of the data, preprocessing results before application of the techniques, as well as the model results) and the workflow team, especially in view of the fact that other datasets with different characteristics and much larger sizes will be investigated in the future.

# Data mining techniques under consideration¶

## Summarization Techniques¶

## Predictive techniques¶

## Descriptive approaches/ clustering (Sabine)¶

## Outlier detection¶

## Pattern detection, including functional dependencies (Pascal)¶

## Techniques for streaming data¶

## Incremental approaches¶

## Approximation techniques¶

## Distributed approaches¶

## Parallel techniques (Sabine)¶

## Active learning / Semi-supervised techniques¶

Active learning [1] is a form of supervised machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal experimental design.

There are situations in which unlabelled data is abundant but manually labelling it is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. With this approach, there is a risk that the algorithm be overwhelmed by uninformative examples.

In this project we will use two different active learning approaches:

Guided search (model disambiguation):- Since we do not have enough information to decide which model is better, predict the rest of the rows with all the models.
- Compute the minimum set of observations that will allow deciding which model is better (set with maximum divergence of predictions between models).
- Inform astronomers of them so that we can have more informative results quickly.
- Once new data is available, retrain all models and check if there is some improvement.
- Iterate until a model reaches a desired classification accuracy or no progress is shown in any model.

- This is the dual strategy to Guided Search.
- Compute the set of predictions for which most models agree on the predictions. The idea is that a prediction error found in these predictions will have an impact on most of the classifiers. Positive predictions with consensus might be already of interest!
- Inform astronomers of them so that we can have the real values of the predictions.
- Once new data is available, retrain all models and check if there is some improvement.
- Iterate until a model reaches a desired classification accuracy or no progress is shown in any model.

References:

[1] B. Settlers, Active learning, Synthesis Lectures on Artificial Intelligence and Machine Learning

Morgan and Claypool Publishers, June 2012, 114 pages.

# Cost model and Evaluation of Results¶

One particular challenge to be addressed is the need for a cost model that takes into account the (known) algorithmic complexities of all techniques, the characteristics of the data, and the computational environment. While this is less relevant for small datasets and/or less complex techniques, it is required for analysis of Big Data, and even for the application of complex techniques to medium-sized datasets, given the interactive nature of the data-mining process. Part of this process is the evaluation of the computational environment, which can range from machines with a small number of cores to large distributed enironments, possibly supported by accelerator architectures. Furthermore, the software environment will have to be taken into account (Weka, R, code written in compiled language such as C, etc)

In addition, we will have to determine which measures to use for the evaluation of results of individual models, and for comparison of models. Existing measures such as classification accuracy, precision, recall, ROC curves, etc are sufficient for small data, but may require some addition/modification for the use with Big Data. One particular interesting aspect of this project is also the visualization of the model results, which can guide the evaluation process as well.

# Data mining goals for next 12 months¶

**Should we mention some possible publications here? And in the preprocessing section? And maybe where some Master-level or PhD level work could be?**

- Application of wide range of techniques as listed above to the dataset currently available and a larger version to be forthcoming early January 2013
- Identification of most promising techniques for the available datasets (with domain specialists)
- First step in the generalization of the application of techniques to other datasets (including the climate data), taking into account preliminary cost model relating algorithms, data characteristics, and typical model evaluation measures such as classification accuracy and recall (with workflow team)
- Identification of algorithms that exceed computational results provided by sequential computational environment
- Thorough review of existing approaches for scalable data-mining approaches, including those listed above, and identification of their suitability for our algorithms
- Metadata for models, using modelling language recommended by workflow team
- Identification and Evaluation of possible environments suitable for the application of the data-mining techniques such as WEKA, R, etc (with workflow and visualization teams)
- Provision of a tutorial (wiki pages plus hands-on seminar) to other team members on introductory data mining using WEKA
- Provision of a tutorial (wiki pages plus hands-on seminar) to other team members on outlier detection using data-mining techniques

# Data-Mining goals for next five years

(cross fertilization with other teams to be determined during 2013)¶

- Generalization of the application of techniques to other datasets (including the climate data), taking into account cost model relating algorithms, data characteristics, computational environment, and typical model evaluation measures such as classification accuracy and recall (with workflow team)
- Implementation of a Graphical User Interface
- Implementation of scalable algorithms using suitable environments
- Application of algorithms to wide range of datasets, from multiple domains
- Additional tutorials at a rate of at least one per year, offered through this group, more as required or suitable