Technical Report: The Seventh On Artificial Intelligence and Machine Learning For Estimating Poverty
Technical Report: The Seventh On Artificial Intelligence and Machine Learning For Estimating Poverty
Technical Report: The Seventh On Artificial Intelligence and Machine Learning For Estimating Poverty
September 2018
Executive Summary
The Government of Indonesia has made significant progress in reducing poverty over the past few years, recording
its lowest poverty rate of ten per cent in 2017 measured by income. Many citizens still remain vulnerable given their
marginal position above the national poverty line. But how governments go about estimating poverty, in order to
better target programmes, has never been an easy task. Today, technological advancements are enabling researchers
to use new and efficient methods to learn more about people’s quality of life. In particular, with more and more big
data sources emerging, researchers are seeing the benefits of big data analytics for reducing poverty and improving
citizens’ well-being.
From 15-18 July 2018, Pulse Lab Jakarta research dive brought together academics, public officials and researchers
to dive into a few big data sets to develop new methods and insights on burning policy questions around poverty
reduction. An underlying goal of this Research Dive was to support the Indonesian Government’s development
agenda using artificial intelligence and machine learning, specifically efforts geared towards achieving Sustainable
Development Goal number one on zero poverty. There were four teams and each was assigned a different dataset
with a unique research focus: (1) Measuring Vulnerability to Poverty Using Satellite Imagery, (2) Estimating
City-level Poverty Rates Based on E-commerce Data, (3) Using Twitter Data to Estimate District-Level Poverty in
Greater Jakarta, and (4) Exploring the Connection Between Social Media Activities and Poverty.
This report outlines the research findings from the research sprint and is structured as follows:
1. The first paper describes the data sets that were assigned to the participants.
2. The second paper explores satellite data as a means to measure vulnerability to poverty. The team analysed
nighttime light imageries from satellite over Yogyakarta.
3. The third paper looks on estimating city-level poverty rates in Java island by examining 2016 e-commerce
data from 118 cities. The group also tested the accuracy of using e-commerce data to estimate poverty, by
comparing the results with official government data of poverty levels.
4. The fourth paper discusses how poverty may be estimated at the district level using social media content
and user profiles. The team used natural language processing to conduct content analysis of extracted
public tweets that contained food and poverty sensitive keywords.
5. The last paper explores the relationship between social media activities and poverty (based on survey and
census data at the village and individual level for the Greater Jakarta area).
Pulse Lab Jakarta is grateful for the cooperation of Ministry of National Development Planning, SMERU,
Humanitarian Data Exchange, The National Team for the Acceleration of Poverty Reduction (TNP2K), Directorate
of Central Data and Information Ministry of Social Affairs, OLX Indonesia, The National Institute of Aeronautics
and Space (LAPAN), Universitas Padjadjaran, Universitas Gadjah Mada, Universitas Muhammadiyah Gorontalo,
Universitas Udayana, World Food Programme, Institut Teknologi Sepuluh Nopember, National Statistics Agency
(BPS), Telkom University, Bina Nusantara University, STMIK Akakom Yogyakarta, Pertamina University, and
Sam Ratulangi University. Pulse Lab Jakarta is grateful for the support from Knowledge Sector Initiative (KSI), the
Artificial Intelligence Journal and the Department of Foreign Affairs and Trade (DFAT) Australia.
Advisor Note
Faizal Thamrin
Remote Sensing Advisor
Advisors
Prof. Arief Anshory Yusuf Universitas Padjadjaran
Faizal Thamrin DM Innovation
Prof. Dedi Rosadi Universitas Gadjah Mada
Researchers
Group 1 – Estimating Poverty at the Provincial Level with Satellite Data
Benny Istanto World Food Programme
I Wayan Gede Astawa Karang Universitas Udayana
Nursida Arif Universitas Muhammadiyah Gorontalo
Pamungkas Jutta Prahara Pulse Lab Jakarta
Group 3 - Estimating Poverty at the District Level with Social Media Data
Lili Ayu Wulandhari Bina Nusantara University
Sri Redjeki STMIK AKAKOM Yogyakarta
Widaryatmo Central Statistics Agency
Yunita Sari Universitas Gadjah Mada
Muhammad Rizal Khaefi Pulse Lab Jakarta
Group 4 - Estimating Poverty at the Household Level with Social Media Data and Household
Survey Results
Eka Puspitawati Pertamina University
Eko Fadilah TNP2K
Hizkia H. D. Tasik Sam Ratulangi University
Nurlatifah Central Statistics Agency
Rajius Idzalika Pulse Lab Jakarta
Table of Contents
Data Description for AI and Machine Learning for Estimating Poverty ...................................................... 1
Estimating Poverty at the District Level with Social Media Data .............................................................. 14
Estimating Poverty at the Household Level with Social Media Data and
Household Survey Results .......................................................................................................................... 19
v
Research Dive Artificial Intelligence and Machine Learning for
Estimating Poverty
the country’s economic crises in 1997 and 1998. For the central Poor and Vulnerable and Reducing Inequality : Improving Programme Targeting,
government, poverty reduction has been the main focus of each Design, and Process
4 http://www.worldbank.org/en/country/indonesia/overview
5 http://www.jblumenstock.com/�les/papers/jblumenstock 016 cience .pd f
1 https://www.brookings.edu/blog/future-development/2017/11/07/global-poverty-is- 2 s
6 http://www.jblumenstock.com/�les/papers/jblumenstock 016 cience .pd f
declining-but-not-fast-enough/ 2 s
2 DATASETS were not provided by Pulse Lab Jakarta to support and elaborate
In this section, we explain brie�y about the types of data used by on their proposed solutions. The satellite imagery data was given
the participants during the Research Dive. to team one to estimate poverty at the provincial level. The e-
commerce data was shared with the second team to estimate poverty
2.1 Satellite Imagery Data at the city level. The third group was assigned the social media data.
In this Research Dive, Pulse Lab Jakarta shared Twitter Data from
2.1.1 Indonesian National Institute of Aeronautics and Space
2014 to estimate poverty at the district level. Team four estimated
(LAPAN). In partnership with LAPAN, PLJ provided nighttime satel-
poverty at the household level and was given access to Twitter data
lite data from 2015. The data comes from Visible Infrared Imaging
from 2014 and household survey results.
Radiometer Suite (VIIRS) on Suomi National Polar-orbiting Part-
nership (NPP) and is in .tif format.
2.1.2 Imagery of Asia - Australia. The imagery data is from 2012
and 2016 and can be accessed through NASA website. It comes from
Visible Infrared Imaging Radiometer Suite (VIIRS) on Suomi NPP
Satellite.
2.1.3 World Imagery. PLJ provided the world imagery data 2010-
2013 from Operational Line Scan (OLS) imaging systems on Defense
Meteorological Satellite Program (DMSP) spacecraft version 4.
3
Table 3: Example of Household Survey Data
4
Estimating Poverty at the Provincial Level with Satellite Data
Nursida Arif Pamungkas Jutta Prahara
Universitas Muhammadiyah Gorontalo Pulse Lab Jakarta
Gorontalo, Indonesia Jakarta, Indonesia
[email protected] [email protected]
1 INTRODUCTION
Satellite imagery is increasingly available for free at certain resolu-
tion for global scale and contains a lot of information at pixel level
that could be associated with economic activity. Many research
suggest that satellite data can be used as a proxy for a number of
variables, including urbanization, density, and economic growth.
The environment can be a parameter of poverty but the complex
nature of the population making it di�cult to measure its in�uence.
Spatially the e�ect of environment on poverty can be done with
remote sensing approach. Although there is no evidence that satel-
lite prediction is better in poverty estimation than in conventional
censuses but this approach can be used as support.
One of the parameters that can be extracted from satellite images Figure 1: Land Use Class Groups
is land use. Some previous researchers used the land-use approach
as an indicator of poverty [3] [5]. An increase in population will trig- Yogyakarta Province is choosen for the case study in the research.
ger land clearing for various interests such as settlement, farmland The rationale are because of the historical population living below
or industry. [6] describes one perspective of land use as a functional poverty line has diverse conditions (low to high) and regions (urban,
space intended to accommodate diverse uses. In this perspective rural, coastal).
the land accommodates the growth of the area driven by popula-
tion growth and economic expansion. Rural areas have di�erent 3 DISCUSSION
characteristics with urban areas. According to Law No. 26 of 2007 Geographically, Yogyakarta is a region complete with various forms
and Minister of Public Works Regulation No. 41 of 2007, rural areas of land. In the northern region is the form of volcanic soil, the
are areas that have major agricultural activities including natural eastern and western regions are the plateau, the middle and the
resource management with the arrangement of regional functions south are the lowlands. Di�erent physiographic conditions have
as a place of rural settlements, government services, social services an impact on population distribution and economic progress. So
and economic activities. that the level of poverty will be more vulnerable in areas with
In contrast to urban areas dominated by non-agricultural activ- hilly topography and mountains. This is evidenced by the land use
ities. [4] in more detail de�nes the pattern of cities can be seen shown in �gure 2
from the existence of built areas such as settlements and infrastruc- Figure 2 shows the residential area of Sleman Regency is dom-
ture, while the pattern of villages is dominated by agricultural land, inantly distributed in the southern Sleman region bordering Yo-
forest land and settlement patterns are small and not centralized. gyakarta and Central Sleman. In Kulonprogo District dominant
Rural land is mostly used for mining and agrarian activities, such settlements in the south and east. Bantul District, the dominant
Figure 2: Yogyakarta Land Use Map Figure 3: Land Use Maps with Built Up Area
4 CONCLUSIONS
The results of land use analysis, poverty-prone areas in the Province
of Yogyakarta Special Region of Kulon Progo and Gunung Kidul
since the number of built areas smaller than other regions. In Kulon
Progo, the region with high poverty and vulnerability is generally
an area with a small population and physically land, including
disaster prone areas such as Kecamatan Kokap.
ABSTRACT Vector Regression (SVR) and Arti�cial Neural Network (ANN). Con-
Indonesia abundantly produces big data from various resources, sidering the representation of e-commerce in certain areas in In-
e.g. social media, �nancial transaction, transportation, call detail donesia, the scope of this research is 118 cities in Java island.
records, e-commerce. These types of data have been considered as
potential resources to complement periodic survey, even census, to 2 DATASET
monitor development indicators in which poverty rate is included. The main dataset used in this research is the advertisements of
This research aims to estimate poverty rate at city-level based on e- goods posted in one of the big e-commerce platforms in Indonesia,
commerce data using machine learning methods i.e. Support Vector OLX. The following goods are included in the analysis.
Regression (SVR) and Arti�cial Neural Network (ANN). Feature This study utilizes two main data sources that complements each
selection has been performed with Fast Correlation-Based Filter other.
(FCBF). The result shows that ANN-based model predicts the city- (1) Car
level poverty rate very well, with high accuracy, low error and low (2) Motorbike
bias. This research suggests that e-commerce is potential to be used (3) House to sell
as proxy for city-level poverty rate. (4) House to rent
(5) Apartment to sell
KEYWORDS (6) Apartment to rent
TBC (7) Land to sell
(8) Land to rent
1 INTRODUCTION For each of those goods, the information of number of items
sold, price sold, number of viewers, and number of buyers were
The poverty rate is the ratio of the number of people whose income
extracted. Then, the aggregation by city has been done for each of
falls below the poverty line; taken as half the median household
those information per goods to calculate statistics measurements
income of the total population [1]. In Indonesia, poverty rate is
i.e. sum, average, and standard deviation, to capture both central
produced yearly by Statistics Indonesia by conducting a National
tendency and variation of the data. In total, there are 96 initial
Social and Economic Survey [4]. Despite all the bene�ts conducting
features extracted for this research.
this survey regularly every year, there are a couple of limitations
As the ground-truth, the poverty rate at city-level published by
such as (i) inability to gather information on poverty rate in between
Statistics Indonesia (see Table 1). For both e-commerce and o�cial
the surveys and (ii) requirement of certain resources in order to
data, the data in 2016 has been used for this research.
conduct the surveys.
Along with that, Indonesia abundantly produces big data from Table 1: Train-and-test splitting procedure
various resources, e.g. social media, �nancial transaction, trans-
portation, call detail records, e-commerce, etc. These types of data
Split 1 Split 2
have been considered as potential resources to complement periodic
survey, even census, to monitor development indicators in which Odd observation Train Test
poverty rate is included. As a dimension of economics, poverty rate Even observation Test Train
could be highly correlated with the data related to consumption
and purchasing power, which can be potentially represented by
e-commerce data. According to [3], revenue in the e-commerce 3 METHODOLOGY
market in Indonesia amounts to USD 9,138m in 2018, with user
penetration is at 40% in 2018 and is expected to hit 48,3% in 2022. 3.1 Pre-processing
This research aims to estimate poverty rate at city-level based Given 96 features and city i = i,..,118, normalisation has been done
on e-commerce data using machine learning methods i.e. Support for each feature with the following formula:
Figure 1: Poverty rate (%) at city-level, 2016 [5]
1 T ’l ’l
min W W +C i +C ⇤i (7)
w,b, ,xi⇤ 2
i=1 i=1
subject to w T (X i ) + b Z i + ⇤i Z i w T (X i ) b
math + ⇤i , , ⇤ 0, i = 1, ..., l
while w, C, , , b denote slope matrix, regularization parameter,
slack variable for soft margin, the margin of tolerance, and the
intercept/bias, respectively. The symbol (X i ) indicates mapping
X i into higher dimensional space. The dual problem optimization
is given by
1 ’
l ’
l
min ( ⇤)T Q( ⇤) + ( + ⇤) + Zi ( ⇤) (8)
, ⇤2
i=1 i=1
subject to eT ( ⇤) = 0, 0 , ⇤ C, i = 1, ... , l
where and * denotes Lagrangian multipliers. Q i, j = K(X i , X j ) ⌘ Figure 2: Architecture of ANN
(X i )T (X j ) and e = [1, ..., 1]T . In linear SVR, the decision function
is expressed by
3.5 Performance Measurement
’
l
( i (9) The performance of the models measures by three metrics, i.e. root
Y = i ⇤)hX i , X i +b
i=1
mean squared error (RMSE), bias factor, and accuracy factor. The
equation and description of those measurements detailed in the
In non-linear SVR, the kernel function e.g., RBF transforms the
Table 3.
data input into a higher dimensional feature space to perform the
linear separation. The decision function is computed by
4 RESULT AND CONCLUSION
’
l Three following models have been built in this research.
Y = ( i i ⇤)h (X i ), (X )i + b (10) (1) SVR-based model with 96 features
i=1 (2) SVR-based model with 29 features selected through FCBF
procedure
’
l
( i (11) (3) ANN-based model with 29 features selected through FCBF
Y = i ⇤)hK(X i , X )i +b
i=1
procedure
The RBF kernel is used to deal with non-linear data that can be As discussed in the methodology section, the models perfor-
computed by the following equation. mance is assessed by three metrics, i.e. RMSE, accuracy factor, and
bias factor. The performance of the three models are shown in Table
4. RMSE of SVR-based model with FCBF (4.3363) is only slightly
K(X i , X ) = exp( ||X i X || 2 ) (12)
lower than SVR-based model without feature selection (4.9037),
In this experiment, grid search is performed to determined pa- showing that with or without feature selection, SVR-based model
rameter from [0.01, 0.1, 1, 10, 100, 1000] and parameter from [0.01, produces almost the similar error from the estimation of poverty
0.1, 1, 10, 100]. rate at city-level.
11
Table 2: Performance measurements
Meanwhile, the ANN-based model with FCBF produces much time series or panel data, and data from di�erent e-commerce
smaller RMSE, i.e. 0.2725, than the two precedent SVM-based mod- platforms.
els. This number indicates that estimating poverty rate at city-level
with ANN-FCBF gives very good accuracy, since the RMSE value REFERENCES
almost reaches zero value. Although all three models predict the [1] Organisation for Economic Co-operation and Development. 2018. Poverty Rate.
poverty rate lower than the actual (underestimate), indicated by the (2018). https://data.oecd.org/inequality/poverty-rate.htm.
[2] C. P. T. R. Baranyi J. 1999. Validating and comparing predictive model. Journal
value of bias factor that is less than 1, the bias factor for ANN-FCBF of Food Microbiology 3 (1999).
model is only slightly less from 1 (0.9981). Moreover, the ANN- [3] Statistia. [n. d.]. ([n. d.]). https://www.statista.com/outlook/243/120/ecommerce/
FCBF model gives almost the accurate prediction since the value of indonesia
[4] Badan Pusat Statistik. [n. d.]. ([n. d.]). https://microdata.bps.go.id/mikrodata/
accuracy factor is very close to 1 (1.0007). index.php/catalog/SUSENAS/about
Figure 2 and Figure 3 show predicted poverty rate based on [5] Badan Pusat Statistik. [n. d.]. Persentase Penduduk Miskin Menurut Kabu-
paten/Kota, 2015 - 2017. ([n. d.]). https://www.bps.go.id/dynamictable/2017/08/
SVR-FCBF and ANN-FCBF, respectively, compared with the actual 03/1261/persentase-penduduk-miskin-menurut-kabupaten-kota-2015---2017.
poverty rate. The actual poverty rate is sorted for better visualiza- html
tion. The prediction of poverty rate produced by ANN-FCBF follows [6] V.N. Vapnik. 1998. Statistical Learning Theory.
[7] H. L. L. Yu. 2003. Feature Selection for High-Dimensional Data: Fast Correlation-
the actual poverty rate, with around one-third of city-level poverty Based Filter Solution. Twentieth International Conference on Machine Learning
rate are predicted exactly and precisely the same as the actual one. (ICML) (2003).
Cities with high actual poverty rate are relatively di�cult to predict
its poverty rate by SVM-FCBF model.
13
Estimating Poverty at the District Level with Social Media
Extended Abstract
Lili Ayu Wulandhari Sri Redjeki Yunita Sari
Bina Nusantara University STMIK AKAKOM Yogyakarta Universitas Gadjah Mada
Jakarta, Indonesia Yogyakarta, Indonesia Yogyakarta, Indonesia
[email protected] [email protected] [email protected]
which serves as a potential resource for many applications. Sakaki their dataset. Burger et al. used the content of the tweets and three
et al. [14] investigated the real-time interaction of events such user pro�les including full name, screen name, and description to
as earthquakes, in Twitter and proposed an algorithm to moni- discriminate the gender of the users. Their approaches successfully
tor tweets and to detect a target event. They used Support Vector obtained the best accuracy of 92%. In addition, by using only the
Machine (SVM) and three feature groups for event detection and content of the tweets, the model gained 76% accuracy.
applied Kalman and particle �lters for location estimation. By con- Similar to Burger et al., Flekova et al. [8] explored stylistic varia-
sidering each Twitter user as a sensor, Sakaki et al. constructed a tion with age and income on Twitter. By using variety of features
reporting system which detects earthquake promptly and sends such as word and character lengths, readability measures (i.e. the
noti�cation e-mails to registered users. They reported that their Automatic Readability Index, the Flesch Kincaid Grade Level, the
system is able to deliver noti�cation faster than the announcements Flesch Reading Ease), part-of-speech (POS), and contextuality mea-
broadcasted by the Japan Meteorological Agency (JMA). sure combined with linear regression, they successfully predicted
Another work by Aramaki et al. [3] addressed the issue of detect- age and income groups of the Twitter users. Some interesting �nd-
ing in�uenza epidemic using Twitter. They constructed an in�uenza ings are Flesch Reading Ease–previously reported to correlate with
corpus consists of 0.4 million tweets which are divided into train- education levels at a community level–is highly indicative for in-
ing and testing parts. Using two pre-de�ned conditions, human come. In addition, the increased use of nouns, determiners and
annotation was conducted to assign positive and negative labels to adjectives is correlated higher with age as opposed to income.
the tweets. Several machine learning classi�ers with Bag Of Words
(BOW) features were used to identify whether a given tweet is
positive or negative. Their experiment results outperformed the 3 METHODOLOGY
state-of-the-art Google method by obtaining high correlation (cor-
relation ratio=0.89). This research is conducted in four steps, namely data preparation,
Previous work has also tried to predict demographic attributes data preprocessing and annotation, pseudo labeling and evaluation.
from Twitter which are useful for marketing, personalization, and Data preparation aims to comprehend and analyze information
legal investigation. A work by Burger et al. [6] constructed a large, contained in the data. This step becomes the justi�cation which
multilingual dataset labeled with gender and explored several sta- techniques are chosen for preprocessing. Preprocessing step is con-
tistical approaches for identifying gender of Twitter users. In order ducted to extract keyword as one of leading indicator from raw
to assign gender labels to the Twitter accounts, the authors sampled data. Result from preprocessing step is used in pseudo labeling algo-
the corresponding user pro�les which were obtained by following rithm. Detail explanation of data and pseudo labeling is presented
the Twitter URL links to several of the most represented blog sites in in subsection 3.1 and subsection 3.3
15
Table 1: Example of Twitter Data
class tweets
Figure 3: Twitter users behavior in a day within JA- poor Jumlah uang yg beredar sama, namun penduduk
BODETABEK area meningkat, anak2 itu tidak dapat lapangan kerja,
kemiskinan meningkat
poor Dan anyway, karena pengontrolan yg kurang itu,
KAB_NAME: negara dipuyengkan oleh beban generasi anak yg
• Jakarta Selatan: 1 tidak terkontrol, efeknya,kemiskinan meningkat
• Jakarta Timur: 2 poor Awal bulan padahal, udah miskin ajayaa
• Jakarta Pusat: 3 poor Negara miskin karena penduduknya sendiri. Yg
• Jakarta Barat: 4 miskin ngga berusaha maju yg kaya berlagak
• Jakarta Utara: 5 miskin punya mobil pribadi masih pake bbm sub-
• Kepulauan Seribu: 6 sidi
• Bogor: 7 poor Orang miskin jgn sakit biaya rumah sakit mahal.
• Depok: 8 Orang miskin kerja lebih keras lg bayar uang ku-
• Bekasi: 9 liah anakmu dua kali lipat penghasilanmu.
• Tangerang: 10 non-poor Ngedengerin ecen ngomong suaranya berat bgt
• Tangerang Selatan: 11 kqya bawa beras sekarung -_-
non-poor Kaya materi tapi miskin hati. Kasian
GENDER:
non-poor makanan buat brino... bubur kacang merah + beras
• Male : 1 merah + daging ayam cincang.. udh kyk baby ni
• Female: 2 ci binyo
This data is annotated manually into poor and non-poor classes to non-poor Cendol terbuat dari tepung beras trus di kasih pan-
be the input data in pseudo labeling approach dan gituu...??? Pantesss konyang wak dek nyoo...
:)
Table 2: List of commodities contribute to poverty (%) non-poor Minum beras kencur biar sixpack
(source: Socio Economic National Survey, March 2018)
ACKNOWLEDGMENT
We would like to acknowledge Pulse Lab Jakarta for organising
Research Dive event and providing the data. We also wish to thank
Prof. Arief Anshory Yusuf and Prof. Dedi Rosadi for their insightful
feedbacks.
REFERENCES
[1] APJI APJII. 2016. Penetrasi dan Perilaku Pengguna Internet Indonesia. Infogra�s
Hasil Survey (2016), 1–35.
[2] APJI APJII. 2017. Penetrasi dan Perilaku Pengguna Internet Indonesia. Infogra�s
Hasil Survey (2017), 1–39.
[3] Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. 2011. Twitter Catches the
Flu: Detecting In�uenza Epidemics Using Twitter. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing (EMNLP ’11). Association
for Computational Linguistics, Stroudsburg, PA, USA, 1568–1576. http://dl.acm.
org/citation.cfm?id=2145432.2145600
[4] Pablo D Azar and Andrew W Lo. 2016. Practical Applications of The Wisdom
of Twitter Crowds: Predicting Stock Market Reactions to FOMC Meetings via
Twitter Feeds. Practical Applications 4, 2 (2016), 1–4.
[5] Johan Bollen, Huina Mao, and Xiaojun Zeng. 2011. Twitter mood predicts the
stock market. Journal of computational science 2, 1 (2011), 1–8.
[6] John D. Burger, John Henderson, George Kim, and Guido Zarrella. 2011. Discrim-
inating Gender on Twitter. In Proceedings of the Conference on Empirical Methods
Figure 4: Pseudo labeling in Natural Language Processing (EMNLP ’11). Association for Computational
Linguistics, Stroudsburg, PA, USA, 1301–1309. http://dl.acm.org/citation.cfm?
id=2145432.2145568
based on Socio Economic National Survey (SUSENAS) in 2018 (see [7] Nugroho Dwi Prasetyo and Claudia Hau�. 2015. Twitter-based election prediction
details in Section 3.1). in the developing world. In Proceedings of the 26th ACM Conference on Hypertext
& Social Media. ACM, 149–158.
[8] Lucie Flekova, Daniel PreoŢiuc-Pietro, and Lyle Ungar. 2016. Exploring Stylistic
4 RESULTS AND ANALYSIS Variation with Age and Income on Twitter. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
Table 4 presents the results of our experiment. Using only 10 labeled Association for Computational Linguistics, 313–319. https://doi.org/10.18653/
data as initial seeds, our model obtained 65.73% on accuracy with v1/P16-2051
[9] Matthew S Gerber. 2014. Predicting crime using Twitter and kernel density
precision and recall of 64% and 66% respectively. Our results demon- estimation. , 115–125 pages.
strate that the chosen features are e�ective for this task. Moreover, [10] Dong-Hyun Lee. 2013. Pseudo-Label : The Simple and E�cient Semi-Supervised
the results provide evidences that the pre-de�ned keywords are Learning Method for Deep Neural Networks. (07 2013).
[11] Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. 2017. Language-independent
suitable for identifying tweets with poverty-related information. Gender Prediction on Twitter. In Proceedings of the Second Workshop on NLP and
The model’s performance can be improved by adding more pseudo Computational Social Science. Association for Computational Linguistics, 1–6.
labeled data. However, in our experiment we did not remove fake http://aclweb.org/anthology/W17-2901
[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
tweets which are generated automatically using bot software. Thus, Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
our results may not describe the real poverty condition in a partic- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
ular area. [13] Shigeyuki Sakaki, Yasuhide Miura, Xiaojun Ma, Keigo Hattori, and Tomoko
Ohkuma. 2014. Twitter User Gender Inference Using Combined Analysis of Text
Table 4: Experiment results and Image Processing. In Proceedings of the Third Workshop on Vision and Lan-
guage. Dublin City University and the Association for Computational Linguistics,
54–61. https://doi.org/10.3115/v1/W14-5408
precision recall F1-score accuracy [14] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes
Twitter users: real-time event detection by social sensors. In Proceedings of the
64% 66% 65% 65.73% 19th international conference on World wide web. ACM, 851–860.
ABSTRACT
Social media data, collected automatically through the interaction
of individuals, can provide insights on many emerging issues such
as from social life to politics. This extended abstract will explore
how social media data correlate to poverty measurement on both re-
gional and household levels using community/village level poverty
mapping and household poverty measurement survey. Discovering
ways to measure poverty through social media data o�ers a more
rapid and inexpensive measure of poverty compared to completing
poverty mapping or household surveys. We describe the statistical
techniques that allow us to evaluate the potency of poverty estima-
tion using social media data, particularly Twitter. We also discuss
follow-ups that can contribute to better estimations.
KEYWORDS
Poverty, Twitter, Household Survey
Figure 1: Twitter user penetration from 2012-2018, by coun-
try
1 INTRODUCTION
Household/individual poverty in Indonesia is usually measured
through household/individual expenditure level. This measurement while the IFLS data is used to investigate the correlation between
requires a survey on a representative sample of the household pop- social media and household poverty level measurement.
ulation in Indonesia. The household survey is generally expensive
and time-consuming, and certain time-frame constrains the data 2 TWITTER METADATA PERFORMANCE
generated from the survey. Expanding on available Twitter data, we �nd variation between dis-
To overcome that restriction, we investigate the available con- tricts in Jabotabek regarding total Twitter User IDs, some messages
nection between social media (Twitter) data and household survey posted, some mentioned in messages and total hashtag in messages.
data. Twitter users have signi�cantly increased in numbers for the Districts observed in Jabodetabek region are:
past six years (2012-2018) especially in Asia-Paci�c.
Since increased access to internet services boosts economic 3 REGIONAL LEVEL CORRELATION WITH
growth and improves the well-being of the poor [2], we are in- POVERTY
terested to see how the data from Twitter users (with internet SMERU Poverty map of Indonesia 2015 calculates various poverty
access) can explain the poverty in both regional and household measures based on several surveys conducted by Statistics Indone-
levels. sia. We use two poverty measures to explore against social media
We are using Twitter data of Jabodetabek region in 2014 available data: poverty headcount index and Gini coe�cient.
as part of Research Dive 7 [3] initiative by UN Pulse Lab. For the Poverty headcount index (P0) is de�ned as:
poverty measurement data, we use Poverty Map of Indonesia 2015
1 ’ z
representing poverty (headcount ratio and GINI ratio) at commu- q
10
nity/village level available from SMERU research institute, and the P0 = ( ) (1)
N i=1 z
Indonesian Family Life Survey (IFLS) [1] available from RAND Cor-
poration. The poverty map data is used to investigate the correlation Where:
between social media data and regional poverty level measurement, P0 : Headcount index
3101 Kepulauan Seribu
3171 Kota Jakarta Selatan Total Mention By District
150000
3172 Kota Jakarta Timur
134335
3173 Kota Jakarta Pusat
121782
3174 Kota Jakarta Barat
3175 Kota Jakarta Utara
107177
100000
101970
sum of mention
91716
86831 88940
50,000
3276 Kota Depok
3603 Kabupaten Tangerang
30706
0
3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674
55736 56125
54494
Figure 4: Total mentions in Twitter message in Jabotabek re-
gion 2014, at district level
40,000
sum of total_user_id
40218
36475 37288 37383
33860
23495
20,000
36464
33028
8760
30,000
28792
150
sum of hashtag
0
3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674 21396 21010
20,000
14169
129
0
3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674
203994
151031 148659
142610
135476 136248
sum of total_post
100000
1 ’
264
N
0
G1 = 1 (2)
3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674
(Yi + Yi 1 )
N i=1
22
http://rd.pulselabjakarta.id/