Stab 744
Stab 744
Stab 744
1093/mnras/stab744
Accepted 2021 March 9. Received 2021 February 22; in original form 2020 May 27
ABSTRACT
The ESA’s X-ray Multi-mirror Mission (XMM–Newton) created a new high-quality version of the XMM–Newton serendipitous
source catalogue, 4XMM-DR9, which provides a wealth of information for observed sources. The 4XMM-DR9 catalogue is
correlated with the Sloan Digital Sky Survey (SDSS) DR12 photometric data base and the AllWISE data base; we then get
X-ray sources with information from the X-ray, optical, and/or infrared bands and obtain the XMM–WISE, XMM–SDSS, and
XMM–WISE–SDSS samples. Based on the large spectroscopic surveys of SDSS and the Large Sky Area Multi-object Fiber
Spectroscopic Telescope (LAMOST), we cross-match the XMM–WISE–SDSS sample with sources of known spectral classes,
and obtain known samples of stars, galaxies, and quasars. The distribution of stars, galaxies, and quasars as well as all spectral
classes of stars in 2D parameter space is presented. Various machine-learning methods are applied to different samples from
different bands. The better classified results are retained. For the sample from the X-ray band, a rotation-forest classifier performs
the best. For the sample from the X-ray and infrared bands, a random-forest algorithm outperforms all other methods. For the
samples from the X-ray, optical, and/or infrared bands, the LogitBoost classifier shows its superiority. Thus, all X-ray sources
in the 4XMM-DR9 catalogue with different input patterns are classified by their respective models that are created by these best
methods. Their membership of and membership probabilities for individual X-ray sources are assigned. The classified result
will be of great value for the further research of X-ray sources in greater detail.
Key words: methods: data analysis – methods: statistical – astronomical data bases: miscellaneous – catalogues – stars: general –
galaxies: general.
C 2021 The Author(s)
SDSS sample in a 6 arcsec radius. Keeping the data quality, zWarn- Table 1. The numbers of each class and subclass for the known samples.
ing = 0 is set in the DR16 SpecObj data base when downloading
data; sc poserr≤5 and sc sum flag<3 are set in the 4XMM-DR9 Class Subclass No.
data base; records with default values of ugriz, W1, and W2 are Galaxy AGN 611
removed; records with W1 < 8 and W2 < 7 are deleted; and stars AGN BL 107
in the LAMOST DR5 data base are adopted with S/N in the g or SB 387
i bands greater than 10. When the objects are both identified by SB BL 8
SDSS and LAMOST, only the spectral class of objects in SDSS are SF 1008
retained. If the objects with known spectral class in the XMM–WISE– SF BL 46
SDSS sample have counterparts in DR14Q, the objects are labelled as BL 281
QSO. Finally, the known samples include 3558 stars, 7203 galaxies, Non 219
and 21 040 quasars with information from the X-ray, optical, and 4536
Star O 1
infrared bands. The spectra were identified as stars, galaxies, and
B 5
QSO by the SDSS and LAMOST automated classification pipelines A 79
using template fitting. Detailed information on known samples is F 708
given in Table 1. For the class assigned as galaxies, the subclass Non G 869
is from the LAMOST data base and the subclass’s default value is K 777
from the SDSS data base. The LAMOST pipeline does not provide M 1062
subclasses for galaxies, and all the subclasses for galaxies in the CV 39
LAMOST data base are labelled as Non. The websites relating to DB 5
the above data sets are shown in Table 2. As for the definitions and EM 1
abbreviations in Table 1, AGN is short for active galactic nuclus, WD 10
sdM1 1
AGN BL for broad-line AGN, SB for starburst galaxy, SB BL for
Carbon 1
broad-line SB, SF for star-forming galaxy, SF BL for broad-line SF, QSO 21 040
BL for BL Lacertae objects, CV for cataclysmic variable star, EM
for emission line star, WD for white dwarf, DB for double or binary
star, sdM1 for subdwarf M1 star, Carbon for carbon star, and O, Table 2. The websites for related catalogues.
B, A, F, G, K, M for stars with spectral types of O, B, A, F, G,
K, M, respectively. All these subclasses are assigned by the SDSS 4XMM-DR9 catalogue
and LAMOST automated classification pipelines depending on the https://www.cosmos.esa.int/web/xmm-newton/xsa
spectroscopic characteristics. Bolton et al. (2012) showed that the Spectrally identified stars, galaxies and quasars from SDSS
galaxy spectra from SDSS by the line-fitting code were grouped http://skyserver.sdss.org/dr16/en/tools/search/sql.aspx
Spectrally identified stars, galaxies, and quasars from LAMOST
into AGN, SF, and SB; if the spectra meet log10([O III]/H β) >1.2
http://dr5.lamost.org/v3/catalogue
log10([N II]/H α) + 0.22, the galaxy spectra were identified as AGN,
SDSS DR14 Quasar catalogue (DR14Q)
otherwise, for the equivalent width (EW) of H α, SF if EW(H α) https://www.sdss.org/dr14/algorithms/qso catalog
<50 Å, and SB if EW(H α) >50 Å; galaxies and quasars may be
classified as broad-line (BL) when their line widths are larger than
200 km s−1 ; and stellar spectra were classified as spectral types it as a broad-line (as opposed to a narrow-line) AGN because the
from O to M based on the ELODIE stellar library. The broad- emission line widths are typically more than 2000 km s−1 for broad-
line classification given by the SDSS pipeline does not necessarily line AGNs (Hao et al. 2005). BL Lacertae objects are a subclass of
indicate that an AGN has emission lines broad enough to classify AGNs that have fast and large amplitude variability over the whole
spectra, high and variable polarization, and continuous spectra with to apply all available information. As indicated in Fig. 3, CV stars
no or weak absorption and emission features. Starburst galaxies are have more strong X-ray emission than other stars, and most CV stars
characterized by higher rates of star formation than normal galaxies. and M stars have more strong infrared emission than the remainder
They are either young or rejuvenated galaxies that typically contain of the stars. They can be separated easily from the star sample in
very luminous X-ray sources. Since the separation of subclasses some 2D spaces. Apparently they have obvious differences from the
of galaxies depends on spectral line information, it is difficult to remainder of the star sample as they are mixed together and are thus
discriminate them without spectra. difficult to discriminate.
We select the features [log(fx ), hr1, hr2, hr3, hr4, extent, r, W1,
u − g, g − r, r − i, i − z, z − W1, W1 − W2, log(fx /fr )] of the 3 THE METHOD
known star, galaxy, and quasar samples used for this study. The
selected features are described in Table 3. The 2D plots between two WEKA (the Waikato Environment for Knowledge Analysis; Witten &
attributes from these features are given in Figs 2 and 3. Fig. 2 shows Frank 2005) is a piece of open-source software that is effectively
the differences between stars, galaxies, and quasars, while Fig. 3 used for various machine-learning tasks. It is implemented through
indicates the differences between different spectral classes of stars. a graphical user interface, standard terminal applications, or through
The two figures tell us that it is difficult to discriminate stars, galaxies, a JAVA API. It is widely used for teaching, research, and industrial
and quasars, and different spectral classes of stars depending only applications, and contains a plethora of built-in tools for standard
on two attributes. These attributes all contribute more or less to the machine-learning tasks. These tasks include data pre-processing,
classification. As shown in Fig. 2, most quasars obviously have larger classification, regression, clustering, association rules, attribute se-
log(fx /fr ), r, W1, and W1 − W2 values than stars. It is easy to classify lection, and visualization realized by different algorithms. This
stars and quasars from galaxies with the attribute extent in the X-ray software makes it easy to work with large amounts of data and
band. Nevertheless some AGNs do not appear as X-ray extended if to run and compare various machine-learning algorithms. It has been
the emission is nuclear-dominated; thus they are misclassified as stars successfully applied in astronomy (Zhao & Zhang 2008; Zhang,
or quasars only depending on extent. We check stars and quasars with Zhao & Gao 2008; Zheng & Zhang 2008).
large extent in SIMBAD and NED within a 3 arcsec radius, and find We try various classification algorithms provided by WEKA on
that some of them are galaxies in a group of galaxies, galaxy cluster, our samples and only keep the better classification results. When
or other kinds of objects. Most galaxies indeed have a relatively larger running the software, we all adopt the default setting by 10-fold
extent in the X-ray band. Most galaxies overlap with most quasars in validation while training a model. 10-fold validation refers to a data
their X-ray and infrared information while most galaxies overlap with set that is randomly divided into 10 parts, nine parts of which are
most stars in their X-ray, optical, and infrared information. In order used for training with one part remaining for testing; this procedure
to effectively separate stars, galaxies, and quasars, it is necessary is repeated 10 times.
Figure 2. The distribution of stars, galaxies, and quasars in 2D space; red open circles represent galaxies, blue open diamonds represent quasars, and green
crosses represent stars.
The metrics commonly used to evaluate the performance of a efficiency) is the fraction of true positive predictions among all
classifier include accuracy, precision, recall, and F-measure. For true positive examples, recall (also called completeness) is the
a data set, accuracy is the ratio of the total number of correct fraction of true positive predictions among all predicted positive
predictions to the total number of predictions, precision (also called examples, and F-measure is the weighted average of precision and
Figure 3. The distribution of different spectral classes of stars in 2D space; red filled circles for A stars, blue filled squares for B stars, purple open diamonds
for F stars, green pluses for G stars, purple filled diamonds for K stars, black open squares for M stars, blue filled down triangles for CV stars, yellow crosses
for double stars (DB), and light-green open up triangles for white dwarf stars (WD).
TP TP
precison = , recall =
2 × precision × recall
F-measure = . (3)
precision + recall
For the sample from the X-ray band only, the classification of a classifier, the number of sources with X-ray emission is small in
performance of random forest and rotation forest is shown in Table 5. nature, so we do not set a magnitude error limitation on the samples
The input pattern for this sample is log(fx ), hr1, hr2, hr3, hr4, in our work.
extent. As shown in Table 5, for galaxies only, recall and F-measure
decrease, but for stars and quasars, all metrics increase, comparing
5 D I S C U S S I O N A N D A P P L I C AT I O N
the performance of rotation forest with random forest. Rotation
forest outperforms random forest in terms of accuracy (77.80 per cent Comparing Tables 5–8, the worst result belongs to the sample from
versus 77.46 per cent). With information from the X-ray band only, the X-ray band only, as expected. Adding the information from the
the classification metrics of quasars are satisfying while those of optical and/or infrared bands, the classification accuracy increases for
galaxies and stars are not good when considering precision, recall, any classifier; nevertheless the accuracy with the X-ray and optical
and F-measure. bands is better than that with the X-ray and infrared bands. The
For the sample from the X-ray and optical bands, the classification best performance is obtained with all information from the X-ray,
performance of random forest and LogitBoost is indicated in Table 6. optical, and infrared bands. There is no algorithm that shows the best
The input pattern is log(fx ), hr1, hr2, hr3, hr4, extent, r, u − g, g performance for every data set. For the sample from the X-ray band,
− r, r − i, i − z, log(fx /fr ). As indicated in Table 6, all metrics for the rotation-forest classifier is the best; for the sample from the X-ray
LogitBoost are better than those for random forest, and all of them and infrared bands, random forest is superior to all other algorithms;
are higher than 84.8 per cent. Only touching on quasars and stars, and for another two samples, LogitBoost shows its superiority.
the metrics are above 87.5 per cent. LogitBoost is superior to random In reality, some X-ray sources have information from the X-
forest for this case, as its accuracy amounts to 92.82 per cent. ray, optical, and infrared bands, some have information from the
For the sample from the X-ray and infrared bands, the classification X-ray and infrared bands, some have information from the X-
performance of random forest and LogitBoost is described in Table 7. ray and optical bands, and some even have X-ray information
The input pattern is log(fx ), hr1, hr2, hr3, hr4, extent, W1, W1 − only. Based on the known samples with spectral classes, we need
W2. As depicted in Table 7, the performance of random forest is a to construct four classifiers for the four situations to predict the
little better than LogitBoost in terms of total accuracy. All metrics unknown X-ray sources. For the sources with X-ray information
for random forest are near to those of LogitBoost. The accuracy of only, a rotation-forest classifier is built with the known samples
galaxies is still worse than that of quasars and stars. Nevertheless, all with spectral classes to predict their classes and probability. For the
metrics are better than 76.1 per cent. The total accuracy of random sources with X-ray and infrared bands, a random-forest classifier
forest is 89.42 per cent. is created with the known samples with spectral classes to predict
For the sample from the X-ray, optical, and infrared bands, the their classes and probability. For the sources from the X-ray and
classification performance of random forest and LogitBoost is listed optical bands or from the X-ray, optical, and infrared bands, Log-
in Table 8. The input pattern is log(fx ), hr1, hr2, hr3, hr4, extent, itBoost classifiers are constructed with the corresponding known
r, W1, u − g, g − r, r − i, i − z, z − W1, W1 − W2, log(fx /fr ). samples with spectral classes to predict their classes and probability,
As shown in Table 8, even for galaxies, the metrics are greater respectively. For the 4XMM-DR9 sources, all predicted results are
than 87.1 per cent; for stars, the metrics are above 90.7 per cent; for shown in Table 10. Table 10 provides classification information for
quasars, the metrics are higher than 95.4 per cent. All metrics except the 4XMM-DR9 sources. The information gained will be of great
precision for LogitBoost are greater than those of random forest. value for further research into the characteristics and physics of X-ray
Compared to random forest, LogitBoost has a slight advantage and sources.
its total accuracy adds up to 94.26 per cent.
In order to check how the observational errors influence the
performance of a classifier, we take the XMM–SDSS sample as 6 CONCLUSIONS
an example. Setting σ u < 0.3, σ g < 0.3, σ r < 0.3, σ i < 0.3, Based on the distribution of stars, galaxies, and quasars in 2D
and σ z < 0.3, the known sample size changes from 31 800 to space, it is difficult to discriminate them and their subclasses clearly.
26 428; the performance of random forest and LogitBoost is shown Similarly, given the distribution of all spectral classes of stars in
in Table 9. Comparing the result in Table 9 with that in Table 6, the 2D space, it is also not so easy to separate them, but CV stars
performances of random forest and LogitBoost both improve with and M stars stand out clearly in some 2D spaces. Of the entire
higher-quality data (94.73 per cent versus 92.57 per cent for random X-ray sample, quasars occupy the majority while stars and galaxies
forest, 94.93 per cent versus 92.82 per cent for LogitBoost) in terms only cover a minority. With X-ray information and spectral classes
of accuracy. Although higher-quality data lead to higher performance of known X-ray sources, we create a rotation-forest classifier to
Random Rotation
Method forest forest
Table 6. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent, r,
u − g, g − r, r − i, i − z, log(fx /fr ).
Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure
Table 7. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent,
W1, W1 − W2.
Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure
Table 8. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent, r,
W1, u − g, g − r, r − i, i − z, z − W1, W1 − W2, log(fx /fr ).
Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure
Table 9. The performance of random forest and LogitBoost with log(fx ), hr1, hr2, hr3, hr4, extent, r,
u − g, g − r, r − i, i − z, log(fx /fr ) when σ u < 0.3, σ g < 0.3, σ r < 0.3, σ i < 0.3, and σ z < 0.3.
Random
Method forest LogitBoost
Class Precision Recall F-measure Precision Recall F-measure
assign classification results and their probabilities for all 4XMM- classes of known X-ray sources, we build LogitBoost classifiers to
DR9 sources. Based on information from the X-ray and infrared predict X-ray sources. The predicted results from different methods
bands as well as spectral classes of known X-ray sources, a random- with different input properties are listed in full in a table, which may
forest classifier is used to discriminate X-ray sources. By means of be used for further study of the X-ray properties of various kinds of
properties from the X-ray, optical, and/or infrared bands and spectral objects in detail.
srcid sc ra sc dec Class x Px Class xo Pxo Class xi Pxi Class xio Pxio