Various data mining algorithms are beingapplied by astronomers in most of the numerous applicationsin astronomy. However, long-term studies and several mining projects have also been made by experts in the field of data mining utilizing datarelated to astronomy becauseastronomy has produced numerous large datasets that are flexible to the approach along with various other fields such as medicine and highenergy physics. Examples of such projects are the SKICAT-Sky Image Cataloging and Analysis System for catalog production and analysis of the catalogfrom digitized sky surveys particularly the scans given by the second Palomar Observatory Sky Survey; the JAR Tool- Jet Propulsion Laboratory Adaptive Recognition Tool used for recognition of volcanoes in over 30,000images of Venus which came by the Magellan mission; the following and more general Diamond Eye and the Lawrence Livermore National Laboratory Sapphire project. Object classification Classification is an crucial preliminary stepin the scientific process as it provides a way for arranging information in a way that may be used to make hypotheses and compare easily withmodels. The two most useful concepts in object classification arethe completeness and the efficiency, also known as recall andprecision.
They are generally defined in terms of true and false positives (TP and FP) and true and false negatives (TN and FN). The completeness is the fraction of thoseobjects that are in reality of a given type that are classified as that type: and the efficiency is the fraction of objects generally classifiedas a given type that are truly of that type These two quantities are interesting astrophysicallybecause, while one wants both higher completeness and efficiency, there isgenerally a tradeoff involved. The importance of each oftendepends on the application, for instance, an investigation of rareobjects generally requires high completeness while allowingsome contamination (lower efficiency), but statistical clustering of cosmological objects requires highefficiency, even at the expense of completeness. Star-Galaxy Separation Due to their physical size in comparison to their distance from us, almost all the stars are unresolved in photometric datasets, andtherefore appear as pointsources. Galaxies despite beingfurther away, generally subtend a larger angle and appear as extended sources. However, other astrophysical objects such as quasars and supernovae,are also seen as as point sources.
Thus, the separation of photometric cataloginto stars and galaxies, or more generally, stars, galaxies andother objects, is an important problem. Thenumber of galaxies and stars in typical surveys (of order 108 or above)requires that such separation be automated. This problem is a well studied oneand automated approaches were employed before currentdata mining algorithms became famous, for instance, during digitization done by thescanning of various photographic plates by machines such as the APM andDPOSS.Several data mining algorithms have been applied, including ANN,DT,mixture modelling and SOM withmost algorithms achieving over efficiency around 95%. Typically, this isperformed using a set of measuredmorphological parameters that are madefrom the survey photometry, with perhaps colors or other information, such as the seeing. The advantageof data mining approach is that all such information about each object is easilyincorporated.
Galaxy Morphology Galaxies come in a range of numerous sizes and shapes, or more collectively,morphology. The most well-known system for the morphological classification ofgalaxies is the Hubble Sequence of elliptical, spiral, barred spiral, andirregular, along with various subclasses. This system correlates to many physical properties knownto be crucial in the formation and formation of galaxies. Becausegalaxy morphology is a tough and complexphenomenon that correlates to the underlying physics, but is not unique to any one given process, theHubble sequence has shown, despite it being rather subjective and based on visible-light morphology originally created fromblue-biased photographic plates. The Hubble sequence has been extended invarious methods, and for data mining purposesthe T system has been extensively taken into consideration. This system maps the categorical Hubble types E, S0, Sa, Sb, Sc, Sd, and Irr onto the numerical values -5 to 10. One can train a supervised algorithm to allot T types to images for which measured parameters are madeavailable. Such parameters can be completely morphological, or comprise ofother information such as color.
Aseries of papers written by Lahavand collaborators do exactly the same, by applying ANNs to predict the T typeof galaxies at low redshift, and finding equal amount of accuracy to human experts. ANNs have also been applied to higher redshift data to distinguish between normal and unique galaxies and thefundamentally topological and unsupervised SOM ANN has been used to classifygalaxies from Hubble Space Telescope images, where the initial distribution of classes is unknown. Likewise, ANNs have been used to obtain morphological types fromgalaxy spectra. Photometric redshifts Anarea of astrophysics that hasgreatly increased in popularity in the last few years is the estimation of redshifts fromphotometric data (photo-zs). This is because, although the distances are less accurate than those obtained with spectra,the sheer number of objects with photometric measurements can often make up for the reduction in individual accuracy by suppressingthe statistical noise of anensemble calculation.
The two common approaches to photo-zs are the templatemethod and the empirical training set method. The template approach has many difficult issues, including calibration, zero-points, priors, multiwavelength performance (e.g., poor in the mid-infrared), and difficulty handling missing or incomplete training data.
We focus in this review on the empirical approach, as itis an implementation of supervised learning. 3.2.1. Galaxies At low redshifts,the calculation of photometric redshifts for normal galaxies is quitestraightforward due to the break in the typical galaxy spectrum at 4000A. Thus,as a galaxy is redshifted with increasing distance, the color (measured as adifference in magnitudes) changes relatively smoothly.
As a result, bothtemplate and empirical photo-z approaches obtain similar outcomes, a root-mean-square deviation of ~ 0.02 in redshift, which is near to the best possible result given the intrinsic spread in the properties. This has been shown with ANNs SVM DT, kNN, empirical polynomial relations, numerous template-based studies, and several other procedures.
At higher redshifts, acheiving accurate results becomes more difficult because the 4000A breakis shifted redward of the optical, galaxies are fainter and thus spectral dataare sparser, and galaxies intrinsically evolve over time. While supervisedlearning has been successfully used, beyond the spectral regime the obviouslimitation arises that in order to reach the limiting magnitude of thephotometric portions of surveys, extrapolation would be required. In thisregime, or where only small training sets are available, template-based resultscan be used, but without spectral information, the templates themselves arebeing extrapolated. However, the extrapolation of the templates is being done in a more physically motivated manner. It is likely that the more generalhybrid approach of using empirical data to iteratively improve the templates or thesemi-supervised procedure described in will ultimately provide a more elegantsolution. Another issue at higher redshift is that the available numbers ofobjects can become quite small (in the hundreds or fewer), thus reintroducingthe curse of dimensionality by a simple lack of objects compared to measuredwavebands.
The methods of dimension reduction can help to mitigate this effect.