Stab 744

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

MNRAS 503, 5263–5273 (2021) doi:10.

1093/mnras/stab744

Classification of 4XMM-DR9 sources by machine learning


Yanxia Zhang ,1‹ Yongheng Zhao1 and Xue-Bing Wu2,3
1 CAS Key Laboratory of Optical Astronomy, National Astronomical Observatories, Beijing 100101, China
2 Department of Astronomy, School of Physics, Peking University, Beijing 100871, China

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


3 Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China

Accepted 2021 March 9. Received 2021 February 22; in original form 2020 May 27

ABSTRACT
The ESA’s X-ray Multi-mirror Mission (XMM–Newton) created a new high-quality version of the XMM–Newton serendipitous
source catalogue, 4XMM-DR9, which provides a wealth of information for observed sources. The 4XMM-DR9 catalogue is
correlated with the Sloan Digital Sky Survey (SDSS) DR12 photometric data base and the AllWISE data base; we then get
X-ray sources with information from the X-ray, optical, and/or infrared bands and obtain the XMM–WISE, XMM–SDSS, and
XMM–WISE–SDSS samples. Based on the large spectroscopic surveys of SDSS and the Large Sky Area Multi-object Fiber
Spectroscopic Telescope (LAMOST), we cross-match the XMM–WISE–SDSS sample with sources of known spectral classes,
and obtain known samples of stars, galaxies, and quasars. The distribution of stars, galaxies, and quasars as well as all spectral
classes of stars in 2D parameter space is presented. Various machine-learning methods are applied to different samples from
different bands. The better classified results are retained. For the sample from the X-ray band, a rotation-forest classifier performs
the best. For the sample from the X-ray and infrared bands, a random-forest algorithm outperforms all other methods. For the
samples from the X-ray, optical, and/or infrared bands, the LogitBoost classifier shows its superiority. Thus, all X-ray sources
in the 4XMM-DR9 catalogue with different input patterns are classified by their respective models that are created by these best
methods. Their membership of and membership probabilities for individual X-ray sources are assigned. The classified result
will be of great value for the further research of X-ray sources in greater detail.
Key words: methods: data analysis – methods: statistical – astronomical data bases: miscellaneous – catalogues – stars: general –
galaxies: general.

DR7 and studied the high-energy properties of various classes of X-


1 I N T RO D U C T I O N
ray sources. Machine learning can gain knowledge from the known
Since all X-rays are prevented from entering by the Earth’s at- examples and create a classifier to predict unknown sources. There-
mosphere, only a space-based telescope can observe and probe fore machine learning makes it possible to classify X-ray sources
celestial X-ray sources. Both NASA’s Chandra X-ray Observatory depending on their multiwavelength and spectroscopic information
and the ESA’s X-ray Multi-mirror Mission (XMM–Newton) are space for known samples. Some work has been done in this direction.
missions in the X-ray band and are leading X-ray astronomy into a For example, Broos et al. (2011) applied a naive Bayes classifier to
new era (Brandt & Hasinger 2005). Significant discoveries have classify X-ray sources from the Chandra Carina Complex Project.
been made with these missions (Santos-Lleo et al. 2009). These Zhang et al. (2013) ran a random-forest algorithm on the cross-
missions may provide answers to other profound cosmic questions matched sample between 2XMMi-DR3 and SDSS-DR8. Farrell,
such as the enigmatic black holes, the formation and evolution of Murphy & Lo (2015) classified the variable 3XMM sources with the
galaxies, dark matter, dark energy, the origins of the Universe, random-forest algorithm. Arnason, Barmby & Vulic (2020) identified
and so on. They are valuable tools to probe X-ray emission from new X-ray binary candidates in M31 also using the random-forest
various astrophysical systems. Despite the implementation of these algorithm.
missions, more and more X-ray sources have still not been identified. In this paper, we download the 4XMM-DR9 catalogue, and
Identification of deep X-ray survey sources is a challenging issue obtain the spectroscopic classes of these X-ray sources from SDSS
for several reasons (Brandt & Hasinger 2005). Large sky-survey and LAMOST, X-ray information from XMM–Newton, optical in-
projects (e.g. the Sloan Digital Sky Survey, SDSS; the Wide-field formation from SDSS, and infrared information from AllWISE.
Infrared Survey Explorer, WISE; the Large Sky Area Multi-object We create classifiers to classify the X-ray sources with known
Fiber Spectroscopic Telescope, LAMOST) provide multiwavelength spectroscopic classes based on only X-ray information; combined
information and spectroscopic classes of X-ray sources. Pineau X-ray and optical/infrared information; or combined X-ray, optical,
et al. (2011) cross-correlated the 2XMMi catalogue with SDSS and infrared information. Section 2 describes the data used and the
distribution of various objects in 2D space. Section 3 presents the
classification methodologies. Section 4 compares the performance of
 E-mail: [email protected] better classifiers for different samples. Section 5 discusses the results


C 2021 The Author(s)

Published by Oxford University Press on behalf of Royal Astronomical Society


5264 Y. Zhang, Y. Zhao and X.-B. Wu
of the classifiers and applies the created classifiers to the unknown contains SDSS observations up to 2018 August, including 880 652
sources. Section 6 provides our conclusions for this work. stars, 2616 381 galaxies, and 749 775 quasars when zWarning = 0 in
the DR16 SpecObj data base.
The Large Sky Area Multi-object Fiber Spectroscopic Telescope
2 T H E DATA
(LAMOST; Cui et al. 2012; Luo et al. 2015) may take 4000 spectra
The European Space Agency’s (ESA) X-ray Multi-mirror Mission in a single exposure to a limiting magnitude as faint as r = 19 mag at
(XMM–Newton) was launched on 1999 December 10, performing the resolution R = 1800. It has finished the first five-year survey plan.

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


in the X-ray, ultraviolet, and optical bands. XMM–Newton is ESA’s The LAMOST survey contains the LAMOST ExtraGAlactic Survey
second cornerstone of the Horizon 2000 Science Programme. It car- (LEGAS) and the LAMOST Experiment for Galactic Understanding
ries three high-throughput X-ray telescopes with an unprecedented and Exploration (LEGUE) survey of Milky Way stellar structure. The
effective area and an optical monitor, the first flown on an X-ray data products of the fifth data release (DR5) include 8183 160 stars
observatory. This mission has released a new high-quality version (7540 605 stars with S/N in the g or i bands greater than 10), 152 863
of the XMM–Newton serendipitous source catalogue 4XMM-DR9. galaxies, 52 453 quasars, and 637 889 unknown objects.
This catalogue includes 810 795 detections of 550 124 unique sources The SDSS Data Release 14 Quasar catalogue (DR14Q; Pâris et al.
drawn from 11 204 XMM–Newton EPIC observations, covering 1152 2018) contains 526 356 spectroscopically identified quasars. DR14Q
degrees2 of the sky in the energy band from 0.2–12 keV (Webb consists of spectroscopically identified quasars from SDSS-I, II, III,
et al. 2020). For the total photon energy band from 0.2–12 keV, the and the latest SDSS-IV eBOSS survey.
median flux of the catalogue detections is ∼2.3 × 10−14 erg cm−2 In order to obtain multiwavelength properties of X-ray sources,
s−1 ; it is ∼5.3 × 10−15 erg cm−2 s−1 in the soft energy band we cross-match the 4XMM-DR9 catalogue with the SDSS and
(0.2–2 keV), and ∼1.2 × 10−14 erg cm−2 s−1 in the hard band AllWISE data bases. According to the work of Covey et al. (2008),
(2–12 keV). About 23 per cent of the sources have total fluxes we estimate spurious SDSS and AllWISE matches by applying a
below 1 × 10−14 erg cm−2 s−1 . The typical positional accuracy is 30 arcsec offset to the X-ray source declinations and searching the
about 2 arcsec. For the astrometric quality, the mean RA and Dec. 4XMM-DR9 catalogue for sources with SDSS counterparts within
offsets between the XMM sources and the SDSS optical quasars are 8 arcsec and AllWISE counterparts within 10 arcsec for each X-ray
−0.01 and 0.005 arcsec respectively with corresponding standard source centroid. Fig. 1 shows a normalized cumulative histogram of
deviations of 0.70 and 0.64 arcsec (see fig. 10 in Webb et al. separation between the 4XMM-DR9 and SDSS sources as well as the
2020). 4XMM-DR9 and AllWISE sources; the solid histogram represents
The Wide-field Infrared Survey Explorer (WISE; Wright et al. the cumulative distribution of separation between the X-ray and
2010) is an entire mid-infrared sky survey with simultaneous pho- optical counterparts for real XMM–SDSS sources within 8 arcsec
tometry in four filters at 3.4, 4.6, 12, and 22 μm (W1, W2, W3, and (left-hand panel of Fig. 1) and that of separation between the X-
W4). It obtained over a million images and observed hundreds of ray and infrared counterparts for real XMM–WISE sources within
millions of celestial objects. The WISE survey provides mid-infrared 10 arcsec (right-hand panel of Fig. 1); the dashed histogram indicates
information about the Solar system, the Milky Way, and the Universe. the upper limit to the fractional contamination of the XMM–SDSS
On the basis of the WISE work, the AllWISE programme has created sample by chance superpositions of independent X-ray and optical
new products with better photometric sensitivity and accuracy as sources (left-hand panel of Fig. 1) and the XMM–WISE sample by
well as better astrometric precision. The limiting magnitudes of W1 chance superpositions of independent X-ray and infrared sources
and W2 are brighter than 19.8 and 19.0 (Vega: 17.1, 15.7) for the (right-hand panel of Fig. 1). In general, high completeness and
AllWISE source catalogue. Sources brighter than 8, 7 in the W1 and low contamination cannot be achieved at the same time; higher
W2 bands are affected by saturation. Considering the accuracy of completeness is needed at the expense of contamination; otherwise,
W1, W2, W3, and W4, we only adopt W1 and W2, converting W1 and if we pursue low contamination, we must sacrifice completeness.
W2 in Vega magnitudes to AB magnitudes by W1AB = W1 + 2.699 For XMM matching SDSS at 3, 4, 5, and 6 arcsec, the com-
and W2AB = W2 + 3.339. The average point spread function (PSF) pleteness versus contamination is respectively 71.68 per cent ver-
with full widths at half-maximum (FWHMs) in W1, W2, W3, and sus 9.75 per cent, 79.49 per cent versus 16.53 per cent, 85.55 per cent
W4 is 6.1, 6.4, 6.5, and 12 arcsec, respectively. For high signal-to- versus 24.50 per cent, and 90.78 per cent versus 33.36 per cent; for
noise ratios (S/N) (>20) sources, the WISE positions are better than XMM matching AllWISE at 3, 6, 7, and 8 arcsec, the com-
0.15 arcsec for 1σ and one axis. pleteness versus contamination is respectively 56.90 per cent ver-
The Sloan Digital Sky Survey (SDSS; York et al. 2000) has been sus 0.89 per cent, 77.66 per cent versus 1.75 per cent, 83.39 per cent
one of the most successful photometric and spectroscopic sky surveys versus 2.01 per cent, and 89.14 per cent versus 2.08 per cent. From
ever made, providing deep multicolour images of a third of the sky Fig. 1, the fraction of X-ray sources matching SDSS occupies over
and spectra for more than 3000 000 celestial objects. Data Release 90 per cent at 6 arcsec and that matching AllWISE is about 90 per cent
12 (DR12) is the final data release of SDSS-III, containing all SDSS at 8 arcsec. So the cross-match radius between the SDSS and 4XMM-
observations up to 2014 July (Eisenstein et al. 2011). It includes DR9 sources is set as 6 arcsec while that between AllWISE and
the complete data set of the BOSS and APOGEE surveys, and now 4XMM-DR9 sources is adopted as 8 arcsec. We apply the software
also includes stellar radial velocity measurements from MARVELS. TOPCAT (Taylor 2005) to perform the cross-matching. Finally we
Data Release 16 (DR16) is the fourth SDSS data release (SDSS-IV; obtain the XMM–WISE sample and the XMM–SDSS sample; the
Blanton et al. 2017). SDSS mapped the sky in the five optical band XMM–WISE–SDSS sample is then derived according to the same ID
passes (ugriz) with central wavelengths of 3551, 4686, 6165, 7481, (srcid) in the XMM–WISE and XMM–SDSS samples. All photome-
and 8931 Å. Pixel size is 0.396 arcsec and the astrometry accuracy tries throughout this paper are extinction-corrected according to the
is less than 0.1 arcsec rms absolute per coordinate. The limiting work of Schindler et al. (2017) and AB magnitudes are adopted.
magnitudes of ugriz are 21.6, 22.2, 22.2, 21.3, and 20.7 at 95 per cent In order to construct the known spectral samples, the samples
completeness, respectively. For u and z, they are converted to AB have been identified spectroscopically by SDSS DR16 and LAMOST
magnitudes by uAB = u − 0.04 mag and zAB = z + 0.02 mag. DR16 DR5. The known samples are cross-matched with the XMM–WISE–

MNRAS 503, 5263–5273 (2021)


Classification of 4XMM-DR9 sources 5265

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


Figure 1. Histogram of source separation between XMM and SDSS as well as between XMM and AllWISE. Red solid histogram: cumulative distribution
of separation between X-ray and optical counterparts (left) and between X-ray and infrared counterparts (right) for the real X-ray sources; blue dashed line:
distribution of separations returned by matching the faked X-ray sources with coordinates shifted by 30 arcsec to the SDSS (left) and AllWISE (right) data bases.

SDSS sample in a 6 arcsec radius. Keeping the data quality, zWarn- Table 1. The numbers of each class and subclass for the known samples.
ing = 0 is set in the DR16 SpecObj data base when downloading
data; sc poserr≤5 and sc sum flag<3 are set in the 4XMM-DR9 Class Subclass No.
data base; records with default values of ugriz, W1, and W2 are Galaxy AGN 611
removed; records with W1 < 8 and W2 < 7 are deleted; and stars AGN BL 107
in the LAMOST DR5 data base are adopted with S/N in the g or SB 387
i bands greater than 10. When the objects are both identified by SB BL 8
SDSS and LAMOST, only the spectral class of objects in SDSS are SF 1008
retained. If the objects with known spectral class in the XMM–WISE– SF BL 46
SDSS sample have counterparts in DR14Q, the objects are labelled as BL 281
QSO. Finally, the known samples include 3558 stars, 7203 galaxies, Non 219
and 21 040 quasars with information from the X-ray, optical, and 4536
Star O 1
infrared bands. The spectra were identified as stars, galaxies, and
B 5
QSO by the SDSS and LAMOST automated classification pipelines A 79
using template fitting. Detailed information on known samples is F 708
given in Table 1. For the class assigned as galaxies, the subclass Non G 869
is from the LAMOST data base and the subclass’s default value is K 777
from the SDSS data base. The LAMOST pipeline does not provide M 1062
subclasses for galaxies, and all the subclasses for galaxies in the CV 39
LAMOST data base are labelled as Non. The websites relating to DB 5
the above data sets are shown in Table 2. As for the definitions and EM 1
abbreviations in Table 1, AGN is short for active galactic nuclus, WD 10
sdM1 1
AGN BL for broad-line AGN, SB for starburst galaxy, SB BL for
Carbon 1
broad-line SB, SF for star-forming galaxy, SF BL for broad-line SF, QSO 21 040
BL for BL Lacertae objects, CV for cataclysmic variable star, EM
for emission line star, WD for white dwarf, DB for double or binary
star, sdM1 for subdwarf M1 star, Carbon for carbon star, and O, Table 2. The websites for related catalogues.
B, A, F, G, K, M for stars with spectral types of O, B, A, F, G,
K, M, respectively. All these subclasses are assigned by the SDSS 4XMM-DR9 catalogue
and LAMOST automated classification pipelines depending on the https://www.cosmos.esa.int/web/xmm-newton/xsa
spectroscopic characteristics. Bolton et al. (2012) showed that the Spectrally identified stars, galaxies and quasars from SDSS
galaxy spectra from SDSS by the line-fitting code were grouped http://skyserver.sdss.org/dr16/en/tools/search/sql.aspx
Spectrally identified stars, galaxies, and quasars from LAMOST
into AGN, SF, and SB; if the spectra meet log10([O III]/H β) >1.2
http://dr5.lamost.org/v3/catalogue
log10([N II]/H α) + 0.22, the galaxy spectra were identified as AGN,
SDSS DR14 Quasar catalogue (DR14Q)
otherwise, for the equivalent width (EW) of H α, SF if EW(H α) https://www.sdss.org/dr14/algorithms/qso catalog
<50 Å, and SB if EW(H α) >50 Å; galaxies and quasars may be
classified as broad-line (BL) when their line widths are larger than
200 km s−1 ; and stellar spectra were classified as spectral types it as a broad-line (as opposed to a narrow-line) AGN because the
from O to M based on the ELODIE stellar library. The broad- emission line widths are typically more than 2000 km s−1 for broad-
line classification given by the SDSS pipeline does not necessarily line AGNs (Hao et al. 2005). BL Lacertae objects are a subclass of
indicate that an AGN has emission lines broad enough to classify AGNs that have fast and large amplitude variability over the whole

MNRAS 503, 5263–5273 (2021)


5266 Y. Zhang, Y. Zhao and X.-B. Wu
Table 3. The parameters, definition, catalogues, and wavebands.

Parameter Definition Catalogue Waveband

srcid Source ID XMM X-ray band


sc ra Right ascension in decimal degrees XMM X-ray band
sc dec Declination in decimal degrees XMM X-ray band
hr1 Hardness ratio 1 XMM X-ray band
Definition: hr1 = (B − A)/(B + A), where

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


A = count rate in energy band 0.2–0.5 keV
B = count rate in energy band 0.5–1 keV
hr2 Hardness ratio 2 XMM X-ray band
Definition: hr2 = (C − B)/(C + B), where
B = count rate in energy band 0.5–1 keV
C = count rate in energy band 1–2 keV
hr3 Hardness ratio 3 XMM X-ray band
Definition: hr3 = (D − C)/(D + C), where
C = count rate in energy band 1–2 keV
D = count rate in energy band 2–4.5 keV
hr4 hardness ratio 4 XMM X-ray band
Definition: hr4 = (E − D)/(E + D), where
D = count rate in energy band 2–4.5 keV
E = count rate in energy band 4.5–12 keV
extent Source extent XMM X-ray band
log(fx ) X-ray flux XMM X-ray band
log(fx /fr ) X-ray-to-optical-flux ratio SDSS, XMM Optical and X-ray bands
u u magnitude SDSS Optical band
g g magnitude SDSS Optical band
r r magnitude SDSS Optical band
i i magnitude SDSS Optical band
z z magnitude SDSS Optical band
W1 W1 magnitude AllWISE Infrared band
W2 W2 magnitude AllWISE Infrared band

spectra, high and variable polarization, and continuous spectra with to apply all available information. As indicated in Fig. 3, CV stars
no or weak absorption and emission features. Starburst galaxies are have more strong X-ray emission than other stars, and most CV stars
characterized by higher rates of star formation than normal galaxies. and M stars have more strong infrared emission than the remainder
They are either young or rejuvenated galaxies that typically contain of the stars. They can be separated easily from the star sample in
very luminous X-ray sources. Since the separation of subclasses some 2D spaces. Apparently they have obvious differences from the
of galaxies depends on spectral line information, it is difficult to remainder of the star sample as they are mixed together and are thus
discriminate them without spectra. difficult to discriminate.
We select the features [log(fx ), hr1, hr2, hr3, hr4, extent, r, W1,
u − g, g − r, r − i, i − z, z − W1, W1 − W2, log(fx /fr )] of the 3 THE METHOD
known star, galaxy, and quasar samples used for this study. The
selected features are described in Table 3. The 2D plots between two WEKA (the Waikato Environment for Knowledge Analysis; Witten &
attributes from these features are given in Figs 2 and 3. Fig. 2 shows Frank 2005) is a piece of open-source software that is effectively
the differences between stars, galaxies, and quasars, while Fig. 3 used for various machine-learning tasks. It is implemented through
indicates the differences between different spectral classes of stars. a graphical user interface, standard terminal applications, or through
The two figures tell us that it is difficult to discriminate stars, galaxies, a JAVA API. It is widely used for teaching, research, and industrial
and quasars, and different spectral classes of stars depending only applications, and contains a plethora of built-in tools for standard
on two attributes. These attributes all contribute more or less to the machine-learning tasks. These tasks include data pre-processing,
classification. As shown in Fig. 2, most quasars obviously have larger classification, regression, clustering, association rules, attribute se-
log(fx /fr ), r, W1, and W1 − W2 values than stars. It is easy to classify lection, and visualization realized by different algorithms. This
stars and quasars from galaxies with the attribute extent in the X-ray software makes it easy to work with large amounts of data and
band. Nevertheless some AGNs do not appear as X-ray extended if to run and compare various machine-learning algorithms. It has been
the emission is nuclear-dominated; thus they are misclassified as stars successfully applied in astronomy (Zhao & Zhang 2008; Zhang,
or quasars only depending on extent. We check stars and quasars with Zhao & Gao 2008; Zheng & Zhang 2008).
large extent in SIMBAD and NED within a 3 arcsec radius, and find We try various classification algorithms provided by WEKA on
that some of them are galaxies in a group of galaxies, galaxy cluster, our samples and only keep the better classification results. When
or other kinds of objects. Most galaxies indeed have a relatively larger running the software, we all adopt the default setting by 10-fold
extent in the X-ray band. Most galaxies overlap with most quasars in validation while training a model. 10-fold validation refers to a data
their X-ray and infrared information while most galaxies overlap with set that is randomly divided into 10 parts, nine parts of which are
most stars in their X-ray, optical, and infrared information. In order used for training with one part remaining for testing; this procedure
to effectively separate stars, galaxies, and quasars, it is necessary is repeated 10 times.

MNRAS 503, 5263–5273 (2021)


Classification of 4XMM-DR9 sources 5267

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023

Figure 2. The distribution of stars, galaxies, and quasars in 2D space; red open circles represent galaxies, blue open diamonds represent quasars, and green
crosses represent stars.

The metrics commonly used to evaluate the performance of a efficiency) is the fraction of true positive predictions among all
classifier include accuracy, precision, recall, and F-measure. For true positive examples, recall (also called completeness) is the
a data set, accuracy is the ratio of the total number of correct fraction of true positive predictions among all predicted positive
predictions to the total number of predictions, precision (also called examples, and F-measure is the weighted average of precision and

MNRAS 503, 5263–5273 (2021)


5268 Y. Zhang, Y. Zhao and X.-B. Wu

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023

Figure 3. The distribution of different spectral classes of stars in 2D space; red filled circles for A stars, blue filled squares for B stars, purple open diamonds
for F stars, green pluses for G stars, purple filled diamonds for K stars, black open squares for M stars, blue filled down triangles for CV stars, yellow crosses
for double stars (DB), and light-green open up triangles for white dwarf stars (WD).

MNRAS 503, 5263–5273 (2021)


Classification of 4XMM-DR9 sources 5269
recall:
TP + TN
accuracy = . (1)
TP + TN + FP + FN
Here TP is the true positive sample, TN is the true negative sample,
FP is the false positive sample, FN is the false negative sample:

TP TP
precison = , recall =

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


(2)
TP + FP TP + FN

2 × precision × recall
F-measure = . (3)
precision + recall

3.1 Random forest


Random forest is a supervised learning algorithm that builds a
randomized decision tree in each iteration of the bagging algorithm
and gives impressive results with very large ensembles (Breiman
2001). The bagging algorithm is applied to improve accuracy by
Figure 4. The simplified random forest.
reducing the variance to make the model more general and avoiding
overfitting. For bagging, multiple subsets is taken as the training
set. For each subset, a model created by the same algorithm is used subset to rebuild full feature space and achieves similar or better
to predict the output for the same test set. Averaging predictions is performance with fewer trees than random forest does. For detailed
considered as the final prediction output. To further understand how information on rotation forest refer to Rodriguez et al. (2006).
the bagging algorithm works, we assume that there are N models and
a data set. This data set is split into training and test sets. Taking a 3.3 LogitBoost
sample of records from the training set, we train the first model with it.
Then, taking another sample from the training set, we train the second LogitBoost is a boosting classification algorithm, based on the
model with it. A similar process will be repeated for N number of logistic regression method by minimizing the logistic loss (Friedman,
models. Based on all predictions of N models on the same test set, we Hastie & Tibshirani 2000). Because noise and outliers exist in data
adopt a model-averaging technique like weighted average, variance, and an exponential loss function is used in LogitBoost, issues like
or max voting to obtain the final prediction. Ensembles are a divide- overfitting will reduce the model accuracy. However classification
and-conquer approach used to improve performance. For ensemble errors are changed linearly instead of exponentially; thus this
methods, ‘weak learners’ are grouped to form a ‘strong learner’. Each may improve the model accuracy and noise immunity. Here the
classifier individually is a ‘weak learner’ (base learner) while all the LogitBoost classification algorithm is trained using random forests
classifiers taken together are a ‘strong learner’. In a decision tree, the as weak learners.
input data are separated into smaller and smaller sets from the tree
root to its leaves. A random forest creates many decision trees. When
4 PERFORMANCE OF THE ALGORITHMS
classifying a new object, each decision tree provides a classification.
The final class of this object depends on the most votes among all the We classify the X-ray sources into some subclasses of galaxies,
trees in the forest. This simplified random forest is shown in Fig. 4. stars, and quasars, based on the input pattern of log(fx ), hr1, hr2,
The advantage of using random forest is that it is able to deal with hr3, hr4, extent, r, W1, u − g, g − r, r − i, i − z, z − W1,
unbalanced and missing data and runs relatively fast. W1 − W2, log(fx /fr ). Since the LAMOST data base does not give
the subclassification of galaxies, we do not consider galaxies from
LAMOST when performing multiclassification. The subclasses of
3.2 Rotation forest
AGN and AGN BL are labelled as AGN; SB and SB BL as SB; SF
Rotation forest is a powerful tree-based ensemble method based and SF BL as SF; and the default value as galaxies. LogitBoost is
on feature extraction and is designed to work with a smaller applied to the known sample without the galaxies from LAMOST
number of ensembles; it focuses on building accurate and diverse by 10-fold validation. The classified result is described in Table 4.
classifiers (Rodriguez, Kuncheva & Alonso 2006). Feature extraction As shown in Table 4, the total accuracy adds up to 90.04 per cent;
by principal component analysis (PCA) is performed on K subsets the metrics of stars and quasars are above 92.7 per cent while those
randomly split from the feature set in turn; here K is a rotation-forest of galaxies are unsatisfactory. The subclasses of galaxies are easily
parameter. All principal components are kept for each subset. The confused. The subclass of default value for galaxies assigned as
original data are handled by the principal component transformation galaxy belongs to normal galaxies, while the subclasses of AGN, SF,
and then used for training each base classifier. Its diversity is realized SB, and BL belong to active galaxies. All metrics of normal galaxies
by the feature extraction carried out on each base classifier and its are larger than 77.0 per cent while those of active galaxies range
accuracy is ensured by the keeping of all principal components and from 7.8 per cent to 76.6 per cent. Active galaxies are likely to be
the use of all of the data as a training sample for each base classifier. classified as normal galaxies or quasars. Obviously it is very difficult
Decision trees are usually selected because they are easily influenced to discriminate active galaxies from the whole sample. Therefore we
by rotation of the feature axes. The difference between random forest use the known samples from LAMOST and SDSS, and only classify
and rotation forest is that rotation forest performs PCA on the feature the sample into galaxies, stars, and quasars in the following work.

MNRAS 503, 5263–5273 (2021)


5270 Y. Zhang, Y. Zhao and X.-B. Wu
Table 4. The performance of LogitBoost for multiclassification.

Known↓Classified→ AGN BL SB SF Galaxy QSO Star Precision Recall F-measure

AGN 149 8 7 98 269 177 10 50.0% 20.8% 29.3%


BL 12 22 1 9 193 34 10 44.9% 7.8% 13.3%
SB 4 0 141 68 23 141 18 76.6% 35.7% 48.7%
SF 53 3 28 472 251 205 42 58.9% 44.8% 50.9%

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


Galaxy 45 14 4 100 3698 605 70 77.0% 81.5% 79.2%
QSO 35 0 0 50 268 20 657 30 94.0% 98.2% 96.1%
Star 0 2 3 4 103 148 3298 94.8% 92.7% 93.7%
Total accuracy 90.04%

For the sample from the X-ray band only, the classification of a classifier, the number of sources with X-ray emission is small in
performance of random forest and rotation forest is shown in Table 5. nature, so we do not set a magnitude error limitation on the samples
The input pattern for this sample is log(fx ), hr1, hr2, hr3, hr4, in our work.
extent. As shown in Table 5, for galaxies only, recall and F-measure
decrease, but for stars and quasars, all metrics increase, comparing
5 D I S C U S S I O N A N D A P P L I C AT I O N
the performance of rotation forest with random forest. Rotation
forest outperforms random forest in terms of accuracy (77.80 per cent Comparing Tables 5–8, the worst result belongs to the sample from
versus 77.46 per cent). With information from the X-ray band only, the X-ray band only, as expected. Adding the information from the
the classification metrics of quasars are satisfying while those of optical and/or infrared bands, the classification accuracy increases for
galaxies and stars are not good when considering precision, recall, any classifier; nevertheless the accuracy with the X-ray and optical
and F-measure. bands is better than that with the X-ray and infrared bands. The
For the sample from the X-ray and optical bands, the classification best performance is obtained with all information from the X-ray,
performance of random forest and LogitBoost is indicated in Table 6. optical, and infrared bands. There is no algorithm that shows the best
The input pattern is log(fx ), hr1, hr2, hr3, hr4, extent, r, u − g, g performance for every data set. For the sample from the X-ray band,
− r, r − i, i − z, log(fx /fr ). As indicated in Table 6, all metrics for the rotation-forest classifier is the best; for the sample from the X-ray
LogitBoost are better than those for random forest, and all of them and infrared bands, random forest is superior to all other algorithms;
are higher than 84.8 per cent. Only touching on quasars and stars, and for another two samples, LogitBoost shows its superiority.
the metrics are above 87.5 per cent. LogitBoost is superior to random In reality, some X-ray sources have information from the X-
forest for this case, as its accuracy amounts to 92.82 per cent. ray, optical, and infrared bands, some have information from the
For the sample from the X-ray and infrared bands, the classification X-ray and infrared bands, some have information from the X-
performance of random forest and LogitBoost is described in Table 7. ray and optical bands, and some even have X-ray information
The input pattern is log(fx ), hr1, hr2, hr3, hr4, extent, W1, W1 − only. Based on the known samples with spectral classes, we need
W2. As depicted in Table 7, the performance of random forest is a to construct four classifiers for the four situations to predict the
little better than LogitBoost in terms of total accuracy. All metrics unknown X-ray sources. For the sources with X-ray information
for random forest are near to those of LogitBoost. The accuracy of only, a rotation-forest classifier is built with the known samples
galaxies is still worse than that of quasars and stars. Nevertheless, all with spectral classes to predict their classes and probability. For the
metrics are better than 76.1 per cent. The total accuracy of random sources with X-ray and infrared bands, a random-forest classifier
forest is 89.42 per cent. is created with the known samples with spectral classes to predict
For the sample from the X-ray, optical, and infrared bands, the their classes and probability. For the sources from the X-ray and
classification performance of random forest and LogitBoost is listed optical bands or from the X-ray, optical, and infrared bands, Log-
in Table 8. The input pattern is log(fx ), hr1, hr2, hr3, hr4, extent, itBoost classifiers are constructed with the corresponding known
r, W1, u − g, g − r, r − i, i − z, z − W1, W1 − W2, log(fx /fr ). samples with spectral classes to predict their classes and probability,
As shown in Table 8, even for galaxies, the metrics are greater respectively. For the 4XMM-DR9 sources, all predicted results are
than 87.1 per cent; for stars, the metrics are above 90.7 per cent; for shown in Table 10. Table 10 provides classification information for
quasars, the metrics are higher than 95.4 per cent. All metrics except the 4XMM-DR9 sources. The information gained will be of great
precision for LogitBoost are greater than those of random forest. value for further research into the characteristics and physics of X-ray
Compared to random forest, LogitBoost has a slight advantage and sources.
its total accuracy adds up to 94.26 per cent.
In order to check how the observational errors influence the
performance of a classifier, we take the XMM–SDSS sample as 6 CONCLUSIONS
an example. Setting σ u < 0.3, σ g < 0.3, σ r < 0.3, σ i < 0.3, Based on the distribution of stars, galaxies, and quasars in 2D
and σ z < 0.3, the known sample size changes from 31 800 to space, it is difficult to discriminate them and their subclasses clearly.
26 428; the performance of random forest and LogitBoost is shown Similarly, given the distribution of all spectral classes of stars in
in Table 9. Comparing the result in Table 9 with that in Table 6, the 2D space, it is also not so easy to separate them, but CV stars
performances of random forest and LogitBoost both improve with and M stars stand out clearly in some 2D spaces. Of the entire
higher-quality data (94.73 per cent versus 92.57 per cent for random X-ray sample, quasars occupy the majority while stars and galaxies
forest, 94.93 per cent versus 92.82 per cent for LogitBoost) in terms only cover a minority. With X-ray information and spectral classes
of accuracy. Although higher-quality data lead to higher performance of known X-ray sources, we create a rotation-forest classifier to

MNRAS 503, 5263–5273 (2021)


Classification of 4XMM-DR9 sources 5271
Table 5. The performance of random forest and rotation forest with log(fx ), hr1, hr2, hr3, hr4, extent.

Random Rotation
Method forest forest

Class Precision Recall F-measure Precision Recall F-measure


QSO 82.1% 93.4% 87.4% 81.4% 94.9% 87.6%
Galaxy 63.0% 43.4% 51.4% 66.0% 40.4% 50.1%

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


Star 64.0% 52.4% 57.6% 65.1% 52.7% 58.3%
Total accuracy 77.46% 77.80%

Table 6. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent, r,
u − g, g − r, r − i, i − z, log(fx /fr ).

Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure

QSO 94.4% 96.1% 95.2% 94.5% 96.2% 95.4%


Galaxy 85.6% 84.8% 85.2% 86.1% 85.3% 85.7%
Star 95.9% 87.5% 91.5% 96.2% 88.1% 92.0%
Total accuracy 92.57% 92.82%

Table 7. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent,
W1, W1 − W2.

Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure

QSO 93.4% 95.9% 94.6% 93.4% 96.8% 95.9%


Galaxy 79.1% 76.1% 77.5% 78.9% 76.1% 76.1%
Star 85.1% 78.2% 81.5% 85.1% 82.3% 77.9%
Total accuracy 89.42% 89.38%

Table 8. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent, r,
W1, u − g, g − r, r − i, i − z, z − W1, W1 − W2, log(fx /fr ).

Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure

QSO 95.4% 97.0% 96.2% 95.6% 97.1% 96.3%


Galaxy 88.4% 87.1% 87.7% 89.1% 87.5% 88.3%
Star 96.9% 90.7% 93.7% 96.8% 91.2% 93.9%
Total accuracy 94.03% 94.26%

Table 9. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent, r,
u − g, g − r, r − i, i − z, log(fx /fr ) when σ u < 0.3, σ g < 0.3, σ r < 0.3, σ i < 0.3, and σ z < 0.3.

Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure

QSO 95.7% 98.1% 96.9% 95.9% 98.1% 97.0%


Galaxy 88.8% 83.7% 86.2% 89.3% 84.1% 86.6%
Star 96.4% 89.8% 93.0% 96.5% 90.9% 93.6%
Total accuracy 94.73% 94.93%

assign classification results and their probabilities for all 4XMM- classes of known X-ray sources, we build LogitBoost classifiers to
DR9 sources. Based on information from the X-ray and infrared predict X-ray sources. The predicted results from different methods
bands as well as spectral classes of known X-ray sources, a random- with different input properties are listed in full in a table, which may
forest classifier is used to discriminate X-ray sources. By means of be used for further study of the X-ray properties of various kinds of
properties from the X-ray, optical, and/or infrared bands and spectral objects in detail.

MNRAS 503, 5263–5273 (2021)


5272 Y. Zhang, Y. Zhao and X.-B. Wu
Table 10. Classification of 4XMM-DR9 sources.

srcid sc ra sc dec Class x Px Class xo Pxo Class xi Pxi Class xio Pxio

200001101010001 64.9255899382624 55.9993455276706 galaxy 0.718 star 1.0


200001101010002 64.9714038006107 55.8049026564271 galaxy 0.427 QSO 0.996 galaxy 0.894 QSO 0.998
200001101010003 65.0767247976311 55.9307646652894 galaxy 0.456 QSO 0.963 galaxy 1.0 star 0.722
200001101010004 65.1112285547752 55.9955363739078 galaxy 0.746 galaxy 1.0 galaxy 0.993 galaxy 1.0

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


200001101010005 64.996228987918 56.2248168838265 star 0.653 star 1.0 star 1.0 star 1.0
200001101010006 64.9348515102436 55.9291776566485 galaxy 0.506
200001101010007 64.8232313435949 55.9849189955416 galaxy 0.485 QSO 1.0 galaxy 1.0 QSO 0.999
200001101010008 65.0734121719342 55.9823011754657 QSO 0.491 star 1.0 star 0.939 star 1.0
200001101010009 65.0167233356568 55.9421102139164 QSO 0.537 QSO 0.997
200001101010010 64.9101008917805 56.0710218248335 QSO 0.502
200001101010011 64.9050705152553 56.0644078750126 QSO 0.539 galaxy 0.955 galaxy 0.999 galaxy 0.943
200001101010012 65.2336693528087 55.8993466831422 galaxy 0.479 star 0.986
200001101010013 64.8914247247132 55.9585111145714 QSO 0.523
200001101010014 64.6507013015166 56.0418886508129 QSO 0.517 star 0.999 galaxy 0.986 star 1.0
200001101010015 64.7925428495702 55.896051166999 star 0.572 star 1.0 star 1.0 star 1.0
200001101010016 65.1527613793266 55.9300031359814 QSO 0.489 galaxy 0.999
200001101010017 65.0424438892887 56.1513807784794 QSO 0.488 QSO 1.0
200001101010018 64.725085117142 55.891398599223 galaxy 0.579 star 1.0 galaxy 1.0 star 0.996
200001101010019 65.1553866221519 55.8977634034868 galaxy 0.452 galaxy 1.0 star 0.654 star 0.702
200001101010020 64.9468670845107 55.9626521430764 galaxy 0.657 galaxy 0.981 galaxy 0.998 galaxy 0.98
Notes. Class x means classification and Px shows their classification probabilities from the X-ray band; Class xo means classification and Pxo shows their
classification probabilities from the X-ray and optical bands; Class xi means classification and Pxi shows their classification probabilities from the X-ray and
infrared bands; Class xio means classification and Pxio shows their classification probabilities from the X-ray, infrared, and optical bands.
This whole table is available at http://paperdata.china-vo.org/zyx/table10.csv. Part of it is shown here to demonstrate its form and content.

AC K N OW L E D G E M E N T S Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-


Institut für Astrophysik (MPA Garching), Max-Planck-Institut für
We are very grateful to the referees for their constructive suggestions.
Extraterrestrische Physik (MPE), National Astronomical Observa-
This paper is funded by the National Natural Science Foundation
tories of China, New Mexico State University, New York Uni-
of China under grants No. 11873066 and No. U1731109. This
versity, University of Notre Dame, Observatário Nacional/MCTI,
research has made use of data obtained from the 4XMM XMM–
the Ohio State University, Pennsylvania State University, Shanghai
Newton serendipitous source catalogue compiled by the 10 institutes
Astronomical Observatory, the United Kingdom Participation Group,
of the XMM–Newton Survey Science Centre selected by ESA. This
Universidad Nacional Autónoma de México, University of Arizona,
publication makes use of data products from the Wide-field Infrared
University of Colorado Boulder, University of Oxford, University of
Survey Explorer, which is a joint project of the University of Cal-
Portsmouth, University of Utah, University of Virginia, University
ifornia, Los Angeles, and the Jet Propulsion Laboratory/California
of Washington, University of Wisconsin, Vanderbilt University, and
Institute of Technology, funded by the National Aeronautics and
Yale University.
Space Administration. The Guoshoujing Telescope (the Large Sky
Area Multi-object Fiber Spectroscopic Telescope, LAMOST) is a
National Major Scientific Project built by the Chinese Academy of DATA AVA I L A B I L I T Y
Sciences. Funding for the project has been provided by the National
Development and Reform Commission. LAMOST is operated and The predicted 4XMM-DR9 catalogue is available in a repository
managed by the National Astronomical Observatories, Chinese and can be accessed using a unique identifier; part of it is shown in
Academy of Sciences. Table 10. It is available on paperdata at http://paperdata.china-vo.org,
We acknowledgment SDSS data bases. Funding for the Sloan and can be accessed at http://paperdata.china-vo.org/zyx/table10.c
Digital Sky Survey IV has been provided by the Alfred P. Sloan sv.
Foundation, the US Department of Energy Office of Science, and
the Participating Institutions. SDSS-IV acknowledges support and
REFERENCES
resources from the Center for High-Performance Computing at the
University of Utah. The SDSS web site is www.sdss.org. SDSS- Arnason R. M., Barmby P., Vulic N., 2020, MNRAS, 492, 5075
IV is managed by the Astrophysical Research Consortium for the Blanton M. R. et al., 2017, AJ, 154, 28
Participating Institutions of the SDSS Collaboration including the Bolton A. S. et al., 2012, AJ, 144, 144
Brazilian Participation Group, the Carnegie Institution for Science, Brandt W. N., Hasinger G., 2005, ARA&A, 43, 727
Carnegie Mellon University, the Chilean Participation Group, the Breiman L., 2001, Machine Learning, 45, 5
Broos P. S., Getman K. V., Povich M. S., Townsley L. K., 2011, ApJS, 194, 4
French Participation Group, the Harvard-Smithsonian Center for
Covey K. R. et al., 2008, ApJS, 178, 339
Astrophysics, Instituto de Astrofı́sica de Canarias, the Johns Hopkins Cui X.-Q. et al., 2012, RAA, 12, 1197
University, Kavli Institute for the Physics and Mathematics of the Eisenstein D. J. et al., 2011, AJ, 142, 72
Universe (IPMU)/University of Tokyo, Lawrence Berkeley National Farrell S. A., Murphy T., Lo K. K., 2015, ApJ, 813, 28
Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max- Friedman J., Hastie T., Tibshirani R., 2000, Ann. Statistics, 28, 337

MNRAS 503, 5263–5273 (2021)


Classification of 4XMM-DR9 sources 5273
Hao L. et al., 2005, AJ, 129, 1783 Zhang Y., Zhao Y., Gao D., 2008, Advances Space Res., 41, 1949
Luo A. L. et al., 2015, RAA, 15, 1095 Zhang Y., Zhou X., Zhao Y., Wu X., 2013, AJ, 145, 42
Pâris I. et al., 2018, A&A, 613, A51 Zhao Y., Zhang Y., 2008, Advances Space Res., 41, 1955
Pineau F.-X., Motch C., Carrera F., Della Ceca R., Derrière S., Michel L., Zheng H., Zhang Y., 2008, Advances Space Res., 41, 1960
Schwope A., Watson M. G., 2011, A&A, 527, A126
Rodriguez J. J., Kuncheva L. I., Alonso C. J., 2006, IEEE Trans. Pattern
Analysis and Machine Intelligence, 28, 1619 S U P P O RT I N G I N F O R M AT I O N
Santos-Lleo M., Schartel N., Tananbaum H., Tucker W., Weisskopf M. C., Supplementary data are available at MNRAS online.

Downloaded from https://academic.oup.com/mnras/article/503/4/5263/6232150 by National Astronomical Observatory user on 22 December 2023


2009, Nature, 462, 997
Schindler J. T., Fan X., McGreer I. D., Yang Q., Wu J., Jiang L., Green R., Table 10. Classification of 4XMM-DR9 sources.
2017, ApJ, 851, 13
Taylor M. B., 2005, in Shopbell P., Britton M., Ebert R., eds, ASP Conf. Please note: Oxford University Press is not responsible for the content
Ser. Vol. 347, Astronomical Data Analysis Software and Systems XIV. or functionality of any supporting materials supplied by the authors.
Astron. Soc. Pac., San Francisco, p. 29 Any queries (other than missing material) should be directed to the
Webb N. A. et al., 2020, A&A, 641, A136 corresponding author for the article.
Witten I. H., Frank E., 2005, Data Mining: Practical Machine Learning
Tools, Techniques with Java Implementations. Morgan Kaufmann, San
Francisco
This paper has been typeset from a TEX/LATEX file prepared by the author.
Wright E. L. et al., 2010, AJ, 140, 1868
York D. G. et al., 2000, AJ, 120, 1579

MNRAS 503, 5263–5273 (2021)

You might also like