Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
Predicting Forced Population Displacement
Using News Articles
Sadra Abrishamkar and Forouq Khonsari
Department of Electrical Engineering and Computer Science
York University, Toronto, Canada
Abstract. The world has witnessed mass forced population displacement across the globe. Population displacement has various indications, with different social and policy consequences. Mitigation of the humanitarian crisis requires tracking and predicting the population movements to
allocate the necessary resources and inform the policymakers. The set of events that triggers population movements can be traced in the news articles. In this paper, we propose the Population
Displacement-Signal Extraction Framework (PD-SEF) to explore a large news corpus and extract
the signals of forced population displacement. PD-SEF measures and evaluates violence signals,
which is a critical factor of forced displacement from it. Following signal extraction, we propose a
displacement prediction model based on extracted violence scores. Experimental results indicate
the effectiveness of our framework in extracting high quality violence scores and building accurate
prediction models.
Keywords: Topic Modeling, Classification, Humanitarian Signal Extraction
1
Introduction
A state of forced migration exists when a significant number of people are displaced
due to socio-political issues, armed conflicts, human-made or natural disasters. This
research focuses on the role of violence and security threats in forced migration.
Tracking and predicting population displacement is a critical step in developing
an early warning system in the humanitarian crisis. Furthermore, the development
of events in conflict regions influences the decision to move and thus to detect
displacement signals is essential for studying forced migration. To capture these
events, news articles are a powerful source.
However, building a quantitative model that extracts and measures the forced
migration signals (particularly violence in this research) and finds the relationship
between violence and forced displacement remains relatively unexplored. To contribute to this problem, we introduce the PD-SEF framework1 to detect signals
of forced population displacement by analyzing news articles. We mainly focus on
Iraq and Syria as case studies and our experiments are centered around detecting
violence as the most critical indicator of forced migration.
1
Code for this project is available on https://github.com/YorkUIRLab/eosdb
DOI:10.5121/mlaij.2019.6101
1
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
The main contribution of this paper is to propose a comprehensive framework for
analyzing a large news corpus and extract the signals that trigger forced population
displacement. Furthermore, a displacement prediction model is devised to take these
signals into account and estimate the future population movements.
The experimental results show the improvement of our extracted violence scores
over the previous state-of-the-art methods and validate our assumption that violence is an effective feature for building prediction models for forced migration.
2
2.1
Related work
Detecting Factors of Forced Migration
In 1997, Schmeidl [12] developed a theoretical model of refugee migration based
on the factors with estimated magnitude. These factors include economical underdevelopment, human rights violation, ethnic and civil conflicts. These factors were
then included in a pooled time-series analysis to predict the number of refugees.
This research showed that economic and intervening policy variables are less useful
for predicting refugee migration than the threat of violence. This work is different
from our model regarding the methodology used for extracting the forced migration signals. Unlike this research that uses manually generated scores from various
resources for the factors mentioned above, we automatically extract the score of
the factors of forced migration from news articles.
To analyze the trends of a particular topic over time, Wang et. al. introduced
Topics over Time (TOT), which is an LDA-based topic model [14]. To achieve
this, they jointly model word co-occurrences and localization in continuous time
windows. Furthermore, time-sensitive document topic modeling was proposed in [2],
where the policies of the European Parliament (EP) are studied through dynamic
topic modeling of the documents generated by the parliaments (e.g. debates and
bills). Political agenda of the EP has evolved significantly over time as new events
and topics unfold.
The most similar research to ours was performed by Agrawal et al. [1]. They
introduced a method to extract the magnitude of violence from news articles. This
method uses word embedding techniques to embed the words of news articles and
then uses similarity measures within the embedding space to compute the similarity between the words of a document and a set of predefined seed words indicating
violence. At last, a correlation was observed between the extracted violence scores
and the number of migrated people. This work only detects the magnitude of violence from news articles and does not focus on other factors of forced migration.
The quality of the extracted violence scores considerably depends on the quality of
the manually generated set of seed words.
2
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
2.2
Topic Modeling
Topic modeling is an effective method to process large amount documents. The
method allows for discovery of a distinct set of topics among the corpus. A topic
modeling approach can connect the words with similar meaning and form topics.
LDA (Latent Dirichlet Allocation) LDA belongs to the group of generative
probabilistic models for documents. Based on an intuitive idea that each document is comprised of a mixture of topics, and each topic is a discrete probability
distribution of how likely each word is to appear on a given topic. Given set of
documents W = {w1 , w2 , ..., wd }, to generate a word token wn from document d,
a discrete topic assignment zn is drawn from a document-specific distribution over
the T topics θd . This value is drawn from a Dirichlet prior with hyper-parameter α.
The inference task in topic models is defined as inferring the document proportions
{θ1 , ...θD } and the topic-specific distributions {φ1 , ..., φT } [9]. Building on foundations of LDA, [7] proposed using a Dirichlet prior on topic-word distribution φ,
with additional hyper-parameter β. Further, [7] used collapsed Gibbs sampling to
estimate the topic space distribution indirectly. This method iteratively estimates
the probability assignment of each word to the topics, conditioned on the current
topic assignment of all other words. The popular topic modeling toolkit MALLET
is based on LDA with Gibbs sampling.
3
OUR PROPOSED FRAMEWORK
The PD-SEF framework consists of three components. The first component includes
analyzing a large corpus of news articles using topic modeling techniques in order
to provide an efficient and compact representation. The dimensionality reduction
achieved by topic modeling is an important benefit for the second component where
human knowledge is needed to analyze the extracted topics. The second component
uses the extracted topics to output a degree of violence for each month, which is
then used in the third component to build prediction models for forecasting the
future number of refugees from Syria and Iraq.
3.1
Topic Modeling
Non-negative Matrix Factorization (NMF) Matrix factorization is a widely
used approach for the analysis of high-dimensional data. The objective of NMF
is to extract meaningful features from a set of non-negative sparse vectors. The
NMF is successfully applied to different applications, such as image processing,
hyper-spectral imaging, and text mining. Here we focus on the property of NMF to
identify topics in a given set of documents and classify the documents among the
underlying topics.
3
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
Fig. 1. NMF factorization
Let each column of the non-negative data matrix X ∈ R represents a document
and each row to a word in the dictionary. In this matrix, the (i, j)th entry corresponds to a number of times the ith word appears in the j th document. Each column
of the matrix X is the word count of a j th document. In practice, the bag-of-word is
replaced by term frequency - inverse document frequency (TF-IDF) representation
of the document. The use of TF-IDF pre-processing has shown to help results of
NMF by down-weighting the contribution of high-frequency terms while boosting
the effect of rarer terms in the corpus [5, 6]. The matrix X is rather sparse since
most of the documents only have a small subset of the dictionary terms.
Given matrix X and factorization rank r, NMF decomposition generates two
non-negative factors W and H, where X ≈ W H (Eq. 1) [8].
X(:, j)
| {z }
j th document.
≈
r
X
k=1
W (:, k)
| {z }
H(k, j)
| {z }
,
kth topic. weight of kth topic in j th document.
withW ≥ 0andH ≥ 0. (1)
The results NMF decomposition can be interpreted by examining the W and H
matrix. Because W is non-negative, each column of W can also be interpreted as a
bag-of-words representation of the document.
At the same time, since weights in linear combinations are also non-negative,
H ≥ 0, the union with sets of words in W can approximate the original document.
Since the number of documents (columns in X) is much larger than a number of
basis elements (columns in W), the elements of W contains the set of words found simultaneously appearing in multiple documents. The weights in linear combinations,
matrix H, assign the documents to different topics. Therefore, NMF can identify
the topics across the documents and classify each document according to the topics
it belongs to. Inspired by the work done in [6] to extract dynamic topics, we apply
NMF on time-windows of news media corpus to extract the evolving topics.
Topic model generation: The first step in topic modeling is to divide the news
corpus into monthly time windows. In this step, the news articles are processed and
labeled according to their timestamps. The length of the time windows can be a
day, week or month. The time windows can have effects on the granularity of the
topics generated. In this framework, we chose the monthly time window for the
topic analysis.
4
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
After generating time windows, we start by generating topic models for each
time window. All documents in the time window are processed and passed to the
topic modeling algorithm. Topic modeling falls under the unsupervised learning
and needs parameter tuning. The most critical parameter in topic modeling is the
number of topics k. Topic models are susceptible to the number of topics, and the
quality of the topics generated varies drastically if k is not selected correctly. To
identify the optimal topic number k, we generate topic models with the number of
topics in the range of k ∈ {10, 50} with an increment of two.
Topic model coherence analysis:
To find the optimal topic number, we apply topic coherence measures to choose
the best result. Here, we apply two different topic coherence measures. Quantifying
the coherence of the topics extracted from topic modeling methods is a crucial part
of this research. The primary concern about the efficiency of the statistical topic
modeling is the presence of weak and obscure topics in the results. Topics with
mixed and loosely related concepts usually fail to generate meaningful insight for
the users. As Mimno et. al. [9] observed, there is a strong relationship between the
number of topics and quality of the topics judged by domain experts. This is because
as the number of topics increases, the quality of word distribution constituting the
topic decreases. At the same time, the lower number of topics may result in an
undesirable generalization of topic distribution in the corpus.
Recent work on the evaluation of statistical topic models have focused on topic’s
semantic coherence [10, 11]. To calculate the topic coherence score, we use TCW2V [10] and Unify framework [11]. We choose the topic model with the highest
coherence to generate the window topic documents.
Fig. 2. Topic generation process
At this step, we have generated and evaluated the coherence of the topic models
for each monthly time window (Fig 2). The topic models are a probability distribution of the words that appear in each topic. For each of the topics, we generate a
topic-document to represent the topic. Topic-documents are generated by ranking
the top keywords appearing in the topic by their probability of appearing in the
topic. The number of topics per time window may vary since we are only using the
topics with best coherence score in the range of k.
5
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
3.2
Violence Scores Extraction
The topic-documents generated from component 1 form a new corpus with reduced
dimensionality. Our next step is to label a topic document with a category such
as violence, relief, economic issues. This step can be done automatically using a
topic labeling method. But we did it manually with the help of social scientist to
ensure the quality of the topic labels. That is because the quality of the labeled
data directly effects the quality of the forced displacement prediction model. The
result of this process is a set of labeled monthly topic-documents.
The violence score for each month is then defined by the total number of violence
topics for each month divided by the total number of topics for each month. As
we want to model the population movements, we need to measure the impact of
violence on people, which causes the movement. The degree of violence somewhat
depends on the scale of other events happening at the same time. Thus, we put the
division to get a scale of the impact of violence on people.
3.3
Topic Classification for automatic labeling
We used the topic modeling results as a way to reduce the dimensionality of the entire news media corpus. The topics, we conjecture, represents the gist of the overall
hidden themes in this setting. The classification task of the news articles based on
the extracted topics is useful for labeling the incoming news media articles. The effort to sort and classify the violence signals from news media has a temporal aspect.
We used different classification techniques to classify extracted topics according to
relativity to factors of forced migration, specifically violence in this research. We
used the topic documents, the top keywords based on the probability distribution
in the topic, as a documents, and used the topic labels (described in Section 4.2)
to train supervised classification task.
For classification task, we examined support vector machine classifier (SVMLinear) [4], XGboost [3], Logistic regression, and Stochastic Gradient Descent.
3.4
Building Prediction Models
A three-step procedure is performed to analyze the extracted violence scores and
to build prediction models for forced displacement:
Step 1: Extracted violence scores are used as input features of regression models
to predict the future number of refugees. This step investigates whether or not the
violence scores are adequately effective to make predictions.
Step 2: To compare the extracted violence scores against a baseline feature, we
build an auto-regression model on time-series of refugee population displacement.
Auto-regression is a regression model with lag variables as input features as it is
built based on the assumption that the previous values (called lag variables) might
6
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
Fig. 3. EOS data set statistics.
affect the future values. An auto-regression model is built in this step using the
previous number of refugees as input features and make predictions relying on no
other external source. The lag variable of 1 means the previous observation in timeseries (t-1) is used to predict future observation (t). The lag variable of 2 means (t-2)
is used to predict (t), and so on. This step investigates validity of the assumption
that previous movements affect the future displacements and provides a baseline
feature for comparing quality of violence scores with.
Step 3: Two previous steps are combined by using both violence scores and lag
variables as inputs for the predictive regression model. This step investigates the
effectiveness of violence scores in improving the accuracy of the prediction model
built in step 2.
4
Experimental Settings
4.1
Datasets
EOS dataset: The Expanded Open Source (EOS)2 dataset, maintained by Georgetown University researchers, is a vast unstructured archive of over 700 million news
articles. The news articles used in this research are filtered based on relativity to
Iraq and Syria using the EOS search engine. The initial filtering helps to have a
more focused, and smaller news set. The corpus consists of 680,456 news articles
spanning from January 2012 to June 2017. Figure 3 describes the document length
characteristics of the data used for this experiment.
UNHCR dataset: We used UNHCR3 Refugee population statistics dataset to
capture the number of refugees on a monthly basis. For building prediction models
we focus on a subset of UNHCR dataset related to Iraqi and Syrian refugees from
2
3
https://osvpr.georgetown.edu/eos
United Nations High Commissioner for Refugees
7
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
2012 to 2017, as EOS dataset provides the news articles for this period of time. As
the distributions of Iraqi and Syrian refugees are similar over the time period and
we do not have distinguished violence scores for Iraq and Syria, a new time series
is made by computing the average of Iraqi and Syrian refugees and is used for all
the following experiments. New time series has a mean of 11,080.05 and standard
deviation of 11,206.95 with a minimum of 0 and maximum of 48,753.
4.2
Time Window Collection & Topic Labeling
To analyze the time evolution of topics over the large document collection, we first
need to separate the documents into time-stamped bins. Followed by the previous works [13], we divide the EOS data into a set of sequential non-overlapping
time windows {T1 , ..., Ti }. Each time window bin consists of sequentially ordered
documents based on the published date. Each bin has a set of non-overlapping
documents divided on a monthly basis.
The monthly time window bins ensure that an enough number of documents
exist in each bin. The reason for creating time window bins is two-fold: first, we
are interested in the topical events in shorter window sizes. Second, the short-lived
topics may be obscured by the generalized topics learned from the entire collection.
The monthly time window also allows us to identify more granular and short-term
topics, as well as generalized topics over longer terms.
We manually labeled the monthly extracted topics into seven categories of violence/terrorism, economic issues, environmental issues, political issues, religious
conflicts, refugee crisis, and relief.
Classification Baselines We compare the performance of our topic classifiers
with various traditional baseline methods, including linear methods, SVM and regularized linear models with stochastic gradient descent (SGD) [2].
4.3
Prediction Models
We evaluate our prediction models in three settings. Predicting t+1, t+2 and t+3.
The Root Mean Square error is reported separately for each time step and the error
of the prediction is calculated based on the test dataset. A walk forward model
evaluation is performed when evaluating the prediction results. Each time step in
the test set will be given to the model one at a time, the model predicts a value for
the given time step and then the actual value for that time step will be accessible
to the model to make the next predictions based on it. This is because we have
more than 10 instances in our test set and it is not possible to accurately predict all
these corresponding values only based on the information available at the present
time. Walk forward model evaluation mimics a real world scenario when we make
predictions for the next 3 months (t+1,t+2,t+3) and then the information about the
8
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
next month will be released (in our case news articles and number of refugees) and
then this information is used to make predictions for the next upcoming months.
Training set includes 80% of the data (Jan 2012-Mar 2016) and test set includes
the last 20% observations in UNHCR dataset (May 2016-Apr 2017) which is equal
to 12 instances. A total of 36 forecasts will be performed. (12 forecasts for each
setting of t+1,t+2,t+3))
Computation For this experiments, we used machine with Intel Core i7, and 32
GB of RAM. The processing news text and time window topics took almost 9 hours
on this machine.
5
5.1
Results and Discussion
Prediction Models Evaluation
Table 1 reports RMSE of several prediction models in different prediction settings
(t+1, t+2, t+3) and steps (with different input features.) For Multi-layer Perceptron regression(MLP), Random Forest, Linear Regression, Support-Vector regression(SVR) and Stochastic gradient descent(SGD) the grid search is used to fine
tune the model parameters and report the best results. We also optimized our Neural Network based models by different optimization functions (ADAM, RMSProp).
The three neural-network based models are:
LSTM and GRU models: 1 input, a hidden layer with 4 LSTM or GRU blocks,
an output layer that makes a single value prediction.
LSTM2LSTM: 1 input layer is feeding into an LSTM layer with the internal
state of size 50, with 0.2 dropout rate and another LSTM layer with the internal
state of size 100, which then feeds into a fully connected standard layer of 1 neuron
with a linear activation function for prediction.
Considering that UNHCR time series has a mean of 11,080 and standard deviation of 11,206, errors of prediction models are within the tolerance range. The least
RMSE for each setting is bold in the table. Intuitively, predicting t+3 is harder
than t+2 which is harder that t+1, because of the more missing data. However,
the improvement of step 3 over step 2 is most tangible in t+2. This is justifiable
due to the 2-month gap explained in section 5.3. Results show that violence scores
are not sufficient to make predictions, however, for 7 out of 8 models, average error
decreased when converting from step2 to step3, indicating the effectiveness of both
lag variables and violence scores in models’ performance. The impact of previous
movements on future movements might be due to the effect of the surrounding
environment on people.
Best performance on average is for MLPregression (Fig 4). We believe this success is due to its ability to capture the historical observations through the input
9
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
Fig. 4. Prediction model performance
window (unlike SGD, SVR and linear regression). Furthermore, it has less complexity compared to LSTM and GRU, making it more suitable for our small dataset.
5.2
Violence Scores Evaluation
To evaluate PD-SEF violence scores and to compare them with violence scores
extracted from EOS articles using the method introduced in [1], we compare both
sets of scores against UNHCR dataset and check for Pearson correlation between
violence scores and number of refugees. The PD-SEF violence showed 22% more
correlation with refugee displacement than violence scores of [1].
Figure 5 shows PD-SEF violence scores against average refugee population of
Iraq and Syria. The PD-SEF is capable of capturing the high trend of violence
during 2015 which may have caused the mass population displacement during this
year. Furthermore, there is a two months gap between the peak of PD-SEF violence
and the peak of population displacement. Removing the gap increases the Pearson
correlation from 0.483 to 0.544, validating the assumption that two months is reasonable amount of time between when the triggers of forced migration are observed
and when the displacement actually takes place for a large number of people.
5.3
Classification Evaluation
Different classification methods are used to classify the topics, including Support
Vector Classifier, XGBoost, linear regressions, and Stochastic Gradient Decent. We
used 10 fold class validation to measure the performance of the classifiers. Table 2
10
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
Table 1. RMSE of forced displacement prediction models
Input Features
Regression Models Lag Variable Violence Scores
X
3,266.05 4,767.52 6,239.41 4,757.64
9,708.36 8,959.28 8,250.36 8,972.66
3,546.03 4,408.93 4,932.07 4,295.67↓
X
X
3,266.06 5,151.28 4,094.39 4,170.57
9,563.09 8,786.27 8,077.21 8,808.85
3,225.44 4,772.04 4,099.56 4,032.34↓
X
X
3,955.5 4,674.38 4,339.82 4,323.23
9,706.84 8,960.13 8,251.46 8,972.81
3,865.65 4,467.03 4,283.05 4,205.24↓
X
X
4,663.21 3,599.11 3,996.31 4,086.21
5,129.4 4,875.94 4,809.29 4,938.21
3,664.41 3,343.65 3,995.79 3,667.95↓
X
X
7,987.04 8,804.43 9,477.96 8,756.47
11,133.63 7,721.94 7,137.77 8,664.44
6,902.37 7,622.25 9,505.81 8,010.14↓
X
X
3,550.41 3,645.06 4,033.56 3,743.01
6,734.52 7,122.58 7,297.1 7,051.4
3,658.19 3,506.19 3,997.17 3,720.51↓
X
X
5,074.14 5,243.61 5,391.58 5,236.44
6,652.57 5,972.41 5,324.02 5,983.0
4,898.5 4,928.62 4,729.99 4,852.37↓
X
X
3,526.84 4,387.69 5,239.25 4,330.53
6,655.03 7,060.31 6,939.52 6,884.95
3,956.09 4,833.23 4,632.28 4,473.86↑
X
LSTM2LSTM
X
X
GRU
X
X
MLP
X
X
Random Forest
X
X
Linear Regression
X
X
SVR
X
X
SGD Regression
X
Predicted Time-Step
t+2
t+3
AVG (t+1,t+2,t+3)
X
X
X
LSTM
t+1
reports the best performance of the aforementioned classification methods in terms
of precision, recall and F1 score. Values of the parameters of the aforementioned
methods, resulting in best performance is described below.
– SVC: Kernel=linear,
– SGD: hinge loss function with l2 regularization
– Logistic Regression: l2 regularization
With the choice of best features the performance of classification has considerably
improved. Experiments reveal that SGD classifier performs the best comparing to
other classifiers. We are representing our topic documents with the top 50 words in
terms of the probability of observing them in the topic. This results shows that the
features we extracted from the unlabeled news media through topic modeling are
suitable for representing the overall latent space of the corpus. Further, we rely on
the classification task to provide automatic label for the population displacement
signals from the news media.
11
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
Fig. 5. Scaled violence scores vs. average displacement
1.0
PD-SEF violence scores
Iraq/Syria Ave refugees
0.8
0.6
0.4
0.2
0.0
2012
2013
2014
2015
2016
2017
Table 2. Topic classification results
Classifier
5.4
Precision Recall F1 Score
SVM (linear) + BOW
SVM (linear) + TF-IDF
SVM (linear) + GloVe
0.819
0.847
0.839
0.818
0.844
0.838
0.818
0.844
0.837
XGBoost Classifier + BOW
XGBoost Classifier + TF-IDF
XGBoost Classifier + GloVe
0.825
0.812
0.840
0.825
0.812
0.838
0.825
0.811
0.838
Logistic Regression + BOW
Logistic Regression + TF-IDF
Logistic Regression + GloVe
0.827
0.840
0.841
0.826
0.838
0.840
0.826
0.838
0.840
SGD Classifier + BOW
SGD Classifier + TF-IDF
SGD Classifier + GloVe
0.810
0.848
0.844
0.809 0.809
0.847 0.847
0.843 0.843
Topic Modeling Evaluation
First, we choose a range of topic numbers for the topic models. In this case, we
choose k to be a range between 10 to 30, with increments of two, that is k ∈ {10, 30}.
Next, the Latent Dirichlet allocation (LDA) Mallet 4 (Gibbs sampling) and Nonnegative Matrix Factorization (NMF) topic models are generated using the window
slice of the corpus. LDA Mallet is reported to be a strong baseline for analyzing the
4
http://mallet.cs.umass.edu/
12
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
performance of the topic models [6]. The window topic space tend to be sensitive
to the local and short bursting topics in the large dataset since the models are
trained on the subset of entire corpus. Table 3 shows the extracted topics and their
corresponding labels.
Table 3. Topics and keyword representation.
Topic label
Top ten words (Sorted by the probability of the word in the topic)
Violence/Terrorism
Refugee Crisis
Economical Issues
Political Issues
killed, Baghdad, wound, car, attack, bomb, people, suicide, police, security
refugee, child, million, Jordan, UNHCR, Syrian refugee, humanitarian, people, flee, aid
oil, barrel, export, crude, company, market, energy, price, sanction, say
talk, Istanbul, round, Baghdad, Iran, meeting, Jalili, negotiation, p5, Tehran
Fig. 6. Topic modeling coherence score
We observed that the NMF topic modeling shows improvements in the topic
modeling coherence scores (Fig 6). This result is in agreements with the results
discovered in [6]. Another significant advantage of the NMF, compare to the probabilistic approaches, is the speed of finding topics. The matrix factorization tends
to be faster than its counterpart probabilistic based approaches.
6
Analysis and Conclusions
In this paper, we proposed a novel framework (PD-SEF) for processing and analyzing news articles to extract the signals of forced migration and use them to build
prediction models for forecasting future displacements. Experiments demonstrated
that violence scores are effective features that feeding them to prediction models
improves the performance of the models.
13
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
The performance of PD-SEF depends on the quality of the news articles such
as complete and accurate coverage of the events. Unfortunately, EOS dataset has
many missing articles during the first six months of 2015, resulting in negative
impact on the quality of our model. This is observable in the sudden drop down in
violence scores extracted by PD-SEF during 2015. We believe PD-SEF will show
more accurate results using a dataset with better coverage. Also, there are many
other factors affecting refugee migration which might not be extractable from news
articles using PD-SEF, such as European Union’s policy on accepting more refugees
from middle east during 2015 resulting in the significant increase in the number of
refugees during this year.
At last, the nature of forced population movement is multi-variant, and therefore
it is challenging to establish coherent ground truth for validating our topics and
predictions. We did not expect a high degree of correlation between our topics and
the refugee movements. This is attributed to the lack of transparent data on the
number, causes, and origins of population movements. Also, we realize that the
monthly analysis of the news articles provides a limited degree of spatio-temporal
awareness, which can’t be directly correlated to any related dataset.
Acknowledgments
We would like to extend special thanks to Dr. Jimmy Huang, Aijun An, and Susan McGrath for their support and valuable advice. This work was supported by
Natural Sciences and Engineering Research Council (NSERC) of Canada and an
NSERC CREATE award. We thank the York Research Chair (YRC) Program and
anonymous reviewers for their insightful comments.
References
1. A. Agrawal, R. Sahdev, H. Davoudi, F. Khonsari, A. An, and S. McGrath. Detecting the
magnitude of events from news articles. In Web Intelligence (WI), 2016 IEEE/WIC/ACM
International Conference on, pages 177–184. IEEE, 2016.
2. L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier
and G. Saporta, editors, Proceedings of the 19th International Conference on Computational
Statistics (COMPSTAT’2010), pages 177–187, Paris, France, August 2010. Springer.
3. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754,
2016.
4. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based
vector machines. J. Mach. Learn. Res., 2:265–292, Mar. 2002.
5. N. Gillis. The why and how of nonnegative matrix factorization. 12, 01 2014.
6. D. Greene and J. P. Cross. Exploring the Political Agenda of the European Parliament Using
a Dynamic Topic Modeling Approach. Political Analysis, 2016.
7. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy
of Sciences, 101(Suppl. 1):5228–5235, April 2004.
8. D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization.
Nature, 401(6755):788–791, oct 1999.
14
Machine Learning and Applications: An International Journal (MLAIJ) Vol.6, No.1, March 2019
9. D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic
coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, EMNLP ’11, pages 262–272, Stroudsburg, PA, USA, 2011.
10. D. O ’callaghan, D. Greene, J. Carthy, and P. Cunningham. An Analysis of the Coherence of
Descriptors in Topic Modeling. Expert Systems with Applications (ESWA), 2015.
11. M. Röder, A. Both, and A. Hinneburg. Exploring the Space of Topic Coherence Measures. In
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining WSDM ’15, pages 399–408, New York, New York, USA, 2015. ACM Press.
12. S. Schmeidl. Exploring the causes of forced migration: A pooled time-series analysis, 19711990. Social Science Quarterly, pages 284–308, 1997.
13. R. Sulo, T. Berger-Wolf, and R. Grossman. Meaningful selection of temporal resolution for
dynamic networks. In Proceedings of the Eighth Workshop on Mining and Learning with
Graphs - MLG ’10, pages 127–136, New York, New York, USA, 2010. ACM Press.
14. X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical
trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’06, pages 424–433, New York, NY, USA, 2006. ACM.
Authors
Sadra Abrishamkar PhD. candidate in computer science at York University
Forouq Khonsari Received Master’s degree in computer science at York University
15