General issues

Astrophysical data displays various characteristics interesting from a data-mining viewpoint, including missing data, errors associated with the measurements, multiple measurements for the same object over time, heterogeneous data, bias in the sample selection, and imbalanced data.
For the particular dataset under investigation, the most relevant issues to be addressed within the next year are the missing values and the extreme imbalance in the data.

Connection to other teams

We are going to assume that the visualization team will provide us with a visualization of the dataset as proposed at the November meeting, and that the workflow group will investigate further possibilities for a modelling language which can be used to address the requirements identified at same meeting (taking into account characteristics of data and computing environment as well). We are also going to expect from other team members that they will provide us with a thorough understanding of the data and how it was obtained.

Current preprocessing requirements

Below are the preprocessing requirements that, based on current knowledge about the data, will be relevant in the short term. The selection of specific preprocessing methods to address these issues will be guided by a thorough statistical analysis.

  • Missing values
    There is a large fraction of missing values in the dataset. In general, there are different mechanisms by which missing data can arise. Different labels for different types are in use, but one common set is that proposed by (needs reference). One way to classify missing data is as follows:
    Missing Completely at Random: the fact that an attribute's value is missing is not related to any attribute's value
    Missing at Random: the fact that a value is missing is directly related to the value of some other attribute
    Censored: The fact that a value is missing is directly related to the attribute value itself (really high or low values missing, for example)
    1. D. Rubin. Inference and Missing Values. Biometrika (1976) 63 (3): 581-592.
    2. R. Little and D. Rubin/ Statistical analysis with missing data. New York, Wiley, 1987
    3. J. Schafer. Analysis of incomplete multivariate data, London, Chapman & Hal, 1987
    4. P. Allison, Missing Data. Thousand Oaks: Sage, 2001
    5. J. Schafer and J. Graham. Missing data: our view of the state of the art. Psychological Methods, 2002
  • Imbalanced data and sampling techniques
  • Noise
  • Correction for distance-dependency of the magnitudes for the larger dataset
  • Correlation detection, attribute selection, and feature extraction
  • Summarization Techniques

Multiple imputation

Managing complex databases with outliers, missing values and many other problems is a common practice in any organization or company. Indeed data preprocessing and standardization is one of the most challenging task of any data mining process. As part of data preprocessing, multiple imputation [1,2,3] are the methods for filling up complex incomplete data.

Nowadays, R [4] is a widely used software for statistical analysis. In the last years, a multiple imputation library based on chained equations [5] has been developed [6], and the current tendency is that researchers and data practitioners use this kind of semiautomatic libraries for data preprocessing.

As the astrophysics dataset we have in this project has many missing values, multiple imputation techniques are mandatory.

  1. D. Rubin. Inference and missing data. Biometrika, 63:581–590, 1976.
  2. D. Rubin. Multiple imputation for nonresponse in surveys. Wiley, New York, 1987.
  3. D. Rubin. Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434):473–489, 1996.
  4. The R Project for Statistical Computing.
  5. T. Raghunathan, J. Lepkowski, J. Van Hoewyk, and P. Solenberger. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27:5–95, 2001.
  6. S. van Buuren and K. Groothuis-Oudshoorn. Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, in press:1–68, 2010.

Preprocessing goals for next 12 months

  • Identification of preprocessing requirements through extensive application of techniques to the dataset currently available and a larger version to be forthcoming early January 2013
  • Compilation of thorough description of bias in the data which may be relevant to the application of the techniques (with Christian Surace)
  • Thorough statistical evaluation of data (Jordi, visualization team)
  • Identification of most promising avenues for preprocessing techniques (including summarization of the data) and identification of challenges such as computational complexity, data access bottlenecks, etc (with workflow team)
  • Identification and Evaluation of possible environments suitable for the application of the pre-processing techniques such as WEKA, R, etc (with workflow and visualization teams)
  • Identification and Evaluation of Summarization Techniques
  • Provision of metadata
  • Provision of a tutorial (wiki pages plus hands-on seminar) to other team members explaining any relevant preprocessing techniques, candidate environments, challenges, etc from a data-mining viewpoint. This is to be set apart from typical preprocessing techniques applied to raw astronomical data.

Preprocessing goals for next five years
(cross fertilization with other teams to be determined during 2013)

  • Compilation of larger dataset from Corot data
  • Identification of alternative astronomical data, compilation of such data
  • Identification of preprocessing requirements for other astrophysical data
  • Identification of impact of bias on preprocessing requirements for all datasets
  • Extension of Metadata to all datasets
  • Implementation of preprocessing techniques in chosen sequential environment
  • Evaluation of scalability of preprocessing for Big Data
  • Parallel implementations of required preprocessing techniques in a number of parallel environments, using computational resources in France (which lab?) and Canada (Sharcnet, needs link)
  • Evaluation of most suitable processing environment, taking into account the cost factor described elsewhere in this Wiki