Most the over 30,000 images of Venus returned

Most of the applications
in this section are made by astronomers utilizing data mining algorithms.
However, several projects and studies have also been made by data mining
experts utilizing astronomical data, because, along with other fields such as
high energy physics and medicine, astronomy has produced many large datasets
that are amenable to the approach. Examples of such projects include the Sky
Image Cataloging and Analysis System (SKICAT) 111 for catalog production
and analysis of catalogs from digitized sky surveys, in particular the scans of
the second Palomar Observatory Sky Survey; the Jet Propulsion Laboratory
Adaptive Recognition Tool (JARTool) 112, used for recognition
of volcanoes in the over 30,000 images of Venus returned by the Magellan
mission; the subsequent and more general Diamond Eye 113; and the Lawrence
Livermore National Laboratory Sapphire project 114


3.1. Object classification

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Classification is often
an important initial step in the scientific process, as it provides a method
for organizing information in a way that can be used to make hypotheses and to
compare with models. Two useful concepts in object classification are the completeness and
the efficiency, also known as recall and precision. They are
defined in terms of true and false positives (TP and FP) and true and false
negatives (TN and FN). The completeness is the fraction of objects that are
truly of a given type that are classified as that type:

and the efficiency is the
fraction of objects classified as a given type that are truly of that type

These two quantities are
astrophysically interesting because, while one obviously wants both higher
completeness and efficiency, there is generally a tradeoff involved. The
importance of each often depends on the application, for example, an
investigation of rare objects generally requires high completeness while
allowing some contamination (lower efficiency), but statistical clustering of
cosmological objects requires high efficiency, even at the expense of

3.1.1. Star-Galaxy

Due to their small
physical size compared to their distance from us, almost all stars are
unresolved in photometric datasets, and thus appear as point sources. Galaxies,
however, despite being further away, generally subtend a larger angle, and thus
appear as extended sources. However, other astrophysical objects such as
quasars and supernovae, also appear as point sources. Thus, the separation of
photometric catalogs into stars and galaxies, or more generally, stars,
galaxies, and other objects, is an important problem. The sheer number of galaxies
and stars in typical surveys (of order 108 or above) requires
that such separation be automated.

This problem is a well
studied one and automated approaches were employed even before current data
mining algorithms became popular, for example, during digitization by the
scanning of photographic plates by machines such as the APM 116 and DPOSS 117. Several data mining
algorithms have been employed, including ANN 118, 119, 120, 121, 122, 123, 124, DT 125, 126, mixture modeling 127, and SOM 128, with most algorithms
achieving over 95% efficiency. Typically, this is done using a set of measured
morphological parameters that are derived from the survey photometry, with
perhaps colors or other information, such as the seeing, as a prior. The
advantage of this data mining approach is that all such information about each
object is easily incorporated.

3.1.2. Galaxy

As shown in Fig. 5, galaxies come in a range of different sizes and
shapes, or more collectively, morphology. The most well-known system for the
morphological classification of galaxies is the Hubble Sequence of elliptical,
spiral, barred spiral, and irregular, along with various subclasses 129, 130, 131, 132, 133, 134. This system correlates
to many physical properties known to be important in the formation and
evolution of galaxies 135, 136.

Because galaxy morphology is a complex phenomenon that correlates
to the underlying physics, but is not unique to any one given process, the
Hubble sequence has endured, despite it being rather subjective and based on
visible-light morphology originally derived from blue-biased photographic
plates. The Hubble sequence has been extended in various ways, and for data
mining purposes the T system 149, 150 has been extensively
used. This system maps the categorical Hubble types E, S0, Sa, Sb, Sc, Sd, and
Irr onto the numerical values -5 to 10.

One can, therefore, train
a supervised algorithm to assign T types to images for which measured
parameters are available. Such parameters can be purely morphological, or
include other information such as color. A series of papers by Lahav and
collaborators 152, 153, 154, 155, 104, 156 do exactly this, by
applying ANNs to predict the T type of galaxies at low redshift, and finding
equal accuracy to human experts. ANNs have also been applied to higher redshift
data to distinguish between normal and peculiar galaxies 157, and the fundamentally
topological and unsupervised SOM ANN has been used to classify galaxies from
Hubble Space Telescope images 74, where the initial
distribution of classes is not known. Likewise, ANNs have been used to obtain
morphological types from galaxy spectra. 158

Photometric redshifts

An area of astrophysics
that has greatly increased in popularity in the last few years is the
estimation of redshifts from photometric data (photo-zs). This is
because, although the distances are less accurate than those obtained with
spectra, the sheer number of objects with photometric measurements can often
make up for the reduction in individual accuracy by suppressing the statistical
noise of an ensemble calculation.

The two common approaches
to photo-zs are the template method and the empirical training set
method. The template approach has many complicating issues 250, including calibration,
zero-points, priors, multiwavelength performance (e.g., poor in the
mid-infrared), and difficulty handling missing or incomplete training data. We
focus in this review on the empirical approach, as it is an implementation of
supervised learning.

3.2.1. Galaxies

At low redshifts, the
calculation of photometric redshifts for normal galaxies is quite
straightforward due to the break in the typical galaxy spectrum at 4000A. Thus,
as a galaxy is redshifted with increasing distance, the color (measured as a
difference in magnitudes) changes relatively smoothly. As a result, both
template and empirical photo-z approaches obtain similar results, a
root-mean-square deviation of ~ 0.02 in redshift, which is close to the best
possible result given the intrinsic spread in the properties 251. This has been shown
with ANNs 33, 165, 156, 252, 253, 254, 124, 255, 256, 257, 179, SVM 258,259, DT 260, kNN 261, empirical polynomial
relations 262, 251, 247, 263, 264, 265, numerous
template-based studies, and several other methods. At higher redshifts,
obtaining accurate results becomes more difficult because the 4000A break is
shifted redward of the optical, galaxies are fainter and thus spectral data are
sparser, and galaxies intrinsically evolve over time. While supervised learning
has been successfully used, beyond the spectral regime the obvious limitation
arises that in order to reach the limiting magnitude of the photometric
portions of surveys, extrapolation would be required. In this regime, or where
only small training sets are available, template-based results can be used, but
without spectral information, the templates themselves are being extrapolated.
However, the extrapolation of the templates is being done in a more physically
motivated manner. It is likely that the more general hybrid approach of using
empirical data to iteratively improve the templates, 266, 267, 268, 269, 270, 271 or the semi-supervised
method described in Section 2.4.3 will ultimately provide a more elegant
solution. Another issue at higher redshift is that the available numbers of
objects can become quite small (in the hundreds or fewer), thus reintroducing
the curse of dimensionality by a simple lack of objects compared to measured
wavebands. The methods of dimension reduction (Section 2.3) can help to mitigate this effect.