Technical Report: The Seventh On Artificial Intelligence and Machine Learning For Estimating Poverty

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Technical Report

The Seventh Research Dive on


Artificial Intelligence and Machine
Learning for Estimating Poverty

September 2018
Executive Summary

The Government of Indonesia has made significant progress in reducing poverty over the past few years, recording
its lowest poverty rate of ten per cent in 2017 measured by income. Many citizens still remain vulnerable given their
marginal position above the national poverty line. But how governments go about estimating poverty, in order to
better target programmes, has never been an easy task. Today, technological advancements are enabling researchers
to use new and efficient methods to learn more about people’s quality of life. In particular, with more and more big
data sources emerging, researchers are seeing the benefits of big data analytics for reducing poverty and improving
citizens’ well-being.

From 15-18 July 2018, Pulse Lab Jakarta research dive brought together academics, public officials and researchers
to dive into a few big data sets to develop new methods and insights on burning policy questions around poverty
reduction. An underlying goal of this Research Dive was to support the Indonesian Government’s development
agenda using ​artificial intelligence and machine learning​, specifically efforts geared towards achieving Sustainable
Development Goal number one on zero poverty. ​There were four teams and each was assigned a different dataset
with a unique research focus: (1) Measuring Vulnerability to Poverty Using Satellite Imagery, (2) Estimating
City-level Poverty Rates Based on E-commerce Data, (3) Using Twitter Data to Estimate District-Level Poverty in
Greater Jakarta, and (4) Exploring the Connection Between Social Media Activities and Poverty.

This report outlines the research findings from the research sprint and is structured as follows:

1. The first paper describes the data sets that were assigned to the participants.
2. The second paper explores satellite data as a means to measure vulnerability to poverty. The team analysed
nighttime light imageries from satellite over Yogyakarta.
3. The third paper looks on estimating city-level ​poverty rates in Java island ​by examining 2016 e-commerce
data from 118 cities. The group also tested the accuracy of using e-commerce data to estimate poverty, by
comparing the results with official government data of poverty levels.
4. The fourth paper discusses how poverty may be estimated at the district level using social media content
and user profiles. The team used natural language processing to conduct content analysis of extracted
public tweets that contained food and poverty sensitive keywords.
5. The last paper ​explores the relationship between social media activities and poverty (based on survey and
census data at the village and individual level for the Greater Jakarta area).

Pulse Lab Jakarta is grateful for the cooperation of ​Ministry of National Development Planning​, SMERU,
Humanitarian Data Exchange, The National Team for the Acceleration of Poverty Reduction (TNP2K), Directorate
of Central Data and Information Ministry of Social Affairs, OLX Indonesia, The National Institute of Aeronautics
and Space (LAPAN), Universitas Padjadjaran, Universitas Gadjah Mada, Universitas Muhammadiyah Gorontalo,
Universitas Udayana, World Food Programme, Institut Teknologi Sepuluh Nopember, National Statistics Agency
(BPS), Telkom University, Bina Nusantara University, STMIK Akakom Yogyakarta, Pertamina University, and
Sam Ratulangi University. Pulse Lab Jakarta is grateful for the support from Knowledge Sector Initiative (KSI), the
Artificial Intelligence Journal and the Department of Foreign Affairs and Trade (DFAT) Australia.
Advisor Note

Estimating Poverty with New Alternative Methods


Any sudden, unfortunate event, such as a flood or a death media, online transactions or other non-conventional data
in the family, could disrupt family finances. For example, sources. These sources of data, often called ‘big’ data, can
a change in weather may mean no income for construction be analysed and filtered into models based on artificial
workers in urban areas or damaged crops for rural farmers. intelligence (AI), to produce sophisticated tools for more
Official poverty-related figures though are only reported frequent and better monitoring. Pulse Lab Jakarta took on
March and September every year, therefore to understand the challenge through their Research Dive initiative, which
poverty throughout Indonesian districts we may need to assembles a selected group of Indonesians from various
wait even much longer. Considering this, it is necessary for disciplines to come up with new ideas of how to combine
us to find alternative estimates of poverty so that assistance big data and AI to improve poverty monitoring, as well as
and programmes to help poor people can be more timely to produce relevant information to help Indonesia’s poverty
and effective. alleviation agenda. I was very proud to be part of this
initiative.
One way of doing so is by way utilizing new sources
of data that have higher frequency such as from social

Prof. Arief Anshory


Economics Advisor

He currently serves as a Professor of Economics in the Department of Economics at


the Faculty of Economic and Business, Universitas Padjadjaran and is the founder
of the Centre for Sustainable Development Goals Studies (SDGs Center). He also
works as the Director of Economy and Environment Institute (EEI) Indonesia. Most
of his research focuses on the economics of the environment and natural resource
management as well as economic development, particularly related to poverty and
inequality. Prof Arief received his PhD in Economics from Australian National
University. He holds a master’s degree in Environmental and Resource Economics
from the University College London and bachelor’s degree in Economics from
Universitas Padjadjaran.

Big Data for Poverty Analysis


Understanding poverty issues is essential for human policy making decision in the country, and help to improve
development and to effectively do so, more collaboration government institutions that are responsible for monitoring
is needed from various experts and citizens. Looking on and evaluating poverty reduction programmes. Currently in
how PLJ’s Research Dive is organised, there are several Indonesia not many researchers conduct poverty analysis
opportunities to meet researchers from different studies with big data, so this is a step in the right direction for the
and join forces to answer burning policy questions through country.
the use of big data. I hope the results can contribute to the

Prof. Dedi Rosadi


Statistics Advisor

Dedi Rosadi is a statistics professor from Universitas Gadjah Mada. He completed


his PhD at Vienna University of Technology, and before that his master’s degree
at the University of Twente. His research interests include time series, statistical
computing and mathematical finance. He has also written books related to statistics
and econometrics, for instance Econometrics and Time Series Analysing using R,
Time Series Analysis and Introduction to Statistical Modeling using R.
Advisor Note

An Effective Two-way Communication Between Advisors and Participants


This is my second time participating in PLJ’s Research two-way communication and learning experience between
Dive for Development. This time around was even more advisors and participants that is also very enriching. I hope
amazing and eye-opening, partly because it was not directly to see more research dives, more practitioners involvement
connected with my areas of work. The deep-diving analysis as well as continuation and improvement of work from
and discussions around “artificial intelligence and machine research dive alumni. Thank you Pulse lab Jakarta for
learning for estimating poverty” gave me a lot of insights allowing me to be a part of this interesting event.
that can be transferred to my own work. For me, the
Research Dive atmosphere and set up allowed an effective

Faizal Thamrin
Remote Sensing Advisor

Faizal Thamrin works for DMInnovation as a Disaster Management Specialist. He has


also been a Data Manager at Humanitarian Data Exchange, focused on strengthening
data collaboration with humanitarian partners, governments and academia. Previously,
he spent around 10 years working with United Nations Office for the Coordination of
Humanitarian Affairs (UNOCHA) in Bangladesh, Indonesia, Philippines, and Pakistan
as a Geographic Information Officer and Information Management Officer.
Research Dive

Advisors
Prof. Arief Anshory Yusuf Universitas Padjadjaran
Faizal Thamrin DM Innovation
Prof. Dedi Rosadi Universitas Gadjah Mada

Researchers
Group 1 – Estimating Poverty at the Provincial Level with Satellite Data
Benny Istanto World Food Programme
I Wayan Gede Astawa Karang Universitas Udayana
Nursida Arif Universitas Muhammadiyah Gorontalo
Pamungkas Jutta Prahara Pulse Lab Jakarta

Group 2 - Estimating Poverty at the City Level with E-Commerce Data


Ana Uluwiyah Central Statistics Agency
Dedy Rahman Wijaya Telkom University
Dwi Rani Puspa Artha LPEM UI
Ni Luh Putu Satyaning Pradnya Paramita Institut Teknologi Sepuluh Nopember
Anissa Zahara Pulse Lab Jakarta
Muhammad Rheza Pulse Lab Jakarta

Group 3 - Estimating Poverty at the District Level with Social Media Data
Lili Ayu Wulandhari Bina Nusantara University
Sri Redjeki STMIK AKAKOM Yogyakarta
Widaryatmo Central Statistics Agency
Yunita Sari Universitas Gadjah Mada
Muhammad Rizal Khaefi Pulse Lab Jakarta

Group 4 - Estimating Poverty at the Household Level with Social Media Data and Household
Survey Results
Eka Puspitawati Pertamina University
Eko Fadilah TNP2K
Hizkia H. D. Tasik Sam Ratulangi University
Nurlatifah Central Statistics Agency
Rajius Idzalika Pulse Lab Jakarta
Table of Contents

Data Description for AI and Machine Learning for Estimating Poverty ...................................................... 1

Estimating Poverty at the Provincial Level with Satellite Data .................................................................... 5

Estimating Poverty at the City Level with E-Commerce Data ..................................................................... 8

Estimating Poverty at the District Level with Social Media Data .............................................................. 14

Estimating Poverty at the Household Level with Social Media Data and
Household Survey Results .......................................................................................................................... 19

v
Research Dive Artificial Intelligence and Machine Learning for
Estimating Poverty

Zakiya Pramestri Dikara Alkarisya Lia Purnamasari


Pulse Lab Jakarta Pulse Lab Jakarta Pulse Lab Jakarta
Jakarta, Indonesia Jakarta, Indonesia Jakarta, Indonesia
[email protected] [email protected] [email protected]

ABSTRACT national medium development plan (RPJMN) 2 . Since the decentral-


When studying poverty, researchers tend to rely on data related isation era, local governments have also been contributing signif-
to economic livelihoods, especially to analyse poverty trends and icantly to poverty reduction through regional poverty reduction
patterns throughout developing countries. Findings and insights strategies (SPKD) 3 . In addition, many local and international devel-
from such analyses are then typically used to design poverty reduc- opment stakeholders have been contributing to poverty alleviation
tion programmes and formulate public policies. However, collecting in Indonesia. As a result, Indonesia has made enormous progress
data on economic livelihoods is often a tall order for researchers by cutting the poverty rate to more than half since 1999, reaching
and governments, in part due to the high costs associated with it. a low 10.9% in 2016 4 .
To address this issue, Pulse Lab Jakarta chose to explore big data - To measure poverty, household income and consumption data
new and diverse digital data sources - that come with the bene�ts are critical for researchers and policymakers in order to design
of increased accuracy, timeliness and cost-e�ciency. In this paper, e�ective and inclusive public policies. However, obtaining such
several research methodologies are discussed, which demonstrate data through traditional methods such as surveys can be costly. For
the ability (and advantages) of using these data sources to esti- instance, some developing countries have been conducting similar
mate consumption expenditure, wealth, and other poverty-related surveys for decades, yet still failed to produce timely data 5 . With the
indicators. help of rising technologies, researchers across various disciplines
To support the Indonesian Government’s agenda on reducing have come up with new approaches and techniques to explore
poverty, Pulse Lab Jakarta organised its 7th Research Dive for De- and utilise new datasets generated by modern technologies 6 –
velopment, focused on exploring arti�cial intelligence and machine commonly referred to as ”big data”. While these new datasets have
learning approaches for Estimating Poverty. Among the invited indeed shown potential to estimate poverty, they are not intended
participants were researchers, public o�cials and academics from to replace conventional datasets. Furthermore, poverty estimation
across Indonesia who mashed up various big datasets to develop can be more useful and accurate if researchers can combine big
methods and insights on burning policy questions. The datasets data with traditional data sources, such as household survey data.
included satellite imagery, e-commerce data, social media data, and One of the Research Dive’s underlying goals is to support the
household survey results. Government’s development agenda; in particular, e�orts geared to-
wards achieving Sustainable Development Goal number one on zero
poverty. Pulse Lab Jakarta invited several participating researchers
1 INTRODUCTION from universities, government institutions, international organi-
Poverty continues to be one of the most pressing issues on the zations, and think tanks across Indonesia. The various disciplines
global development agenda. And with global e�orts geared to- represented included computer science, economics, geographic in-
wards achieving Sustainable Development Goal number one on formation system (GIS), and statistics. The researchers collaborated
zero poverty, governments continue to invest huge amount of re- with Pulse Lab Jakarta to develop methods and insights on burning
sources to reach that goal. Yet, the growing inequality also adds policy questions, speci�cally: (i) estimating poverty at the provin-
another dimension to the issue. Poverty brings consequences. In- cial level with satellite data, (ii) estimating poverty at the city level
versely, reducing poverty can improve other crucial aspects, such with e-commerce data, (iii) estimating poverty at the district level
as public health, economic development and social cohesion. Even with social media data, and (iv) estimating poverty at the household
though the number of people living below the global poverty line level with social media data and household survey results.
has gradually decreased over the last two decades 1 , governments
around the world are continuing their e�orts to make e�ective
policies to fully eliminate poverty. Because the causes and e�ects
of poverty di�er according to contexts, there is also the need to
2 A. Suryahadi et al. 2010. Review of the Government’s Poverty Reduction Strategies,
produce more customized policy to address each unique situation.
Policies, and Programs in Indonesia. The SMERU Research Institute. Jakarta
Poverty reduction has been a major initiative in Indonesia since 3 National Team for The Acceleration of Poverty Reduction. 2014. Reaching Indonesia’s

the country’s economic crises in 1997 and 1998. For the central Poor and Vulnerable and Reducing Inequality : Improving Programme Targeting,
government, poverty reduction has been the main focus of each Design, and Process
4 http://www.worldbank.org/en/country/indonesia/overview
5 http://www.jblumenstock.com/�les/papers/jblumenstock 016 cience .pd f
1 https://www.brookings.edu/blog/future-development/2017/11/07/global-poverty-is- 2 s
6 http://www.jblumenstock.com/�les/papers/jblumenstock 016 cience .pd f
declining-but-not-fast-enough/ 2 s
2 DATASETS were not provided by Pulse Lab Jakarta to support and elaborate
In this section, we explain brie�y about the types of data used by on their proposed solutions. The satellite imagery data was given
the participants during the Research Dive. to team one to estimate poverty at the provincial level. The e-
commerce data was shared with the second team to estimate poverty
2.1 Satellite Imagery Data at the city level. The third group was assigned the social media data.
In this Research Dive, Pulse Lab Jakarta shared Twitter Data from
2.1.1 Indonesian National Institute of Aeronautics and Space
2014 to estimate poverty at the district level. Team four estimated
(LAPAN). In partnership with LAPAN, PLJ provided nighttime satel-
poverty at the household level and was given access to Twitter data
lite data from 2015. The data comes from Visible Infrared Imaging
from 2014 and household survey results.
Radiometer Suite (VIIRS) on Suomi National Polar-orbiting Part-
nership (NPP) and is in .tif format.
2.1.2 Imagery of Asia - Australia. The imagery data is from 2012
and 2016 and can be accessed through NASA website. It comes from
Visible Infrared Imaging Radiometer Suite (VIIRS) on Suomi NPP
Satellite.
2.1.3 World Imagery. PLJ provided the world imagery data 2010-
2013 from Operational Line Scan (OLS) imaging systems on Defense
Meteorological Satellite Program (DMSP) spacecraft version 4.

2.2 E-Commerce Data


In partnership with OLX Indonesia, PLJ provided e-commerce data
of Java Island from 2016. Speci�cally, e-commerce data was provided
on property, car and motorcycle. Under property, we focused on
house, apartment and land area. We aggregated the data to add
information by calculating the average price, average sold price,
standard deviation of price and trimmed price (5% percentile of
the lowest and the highest), total viewer (number of people who
viewed advertisements) and total contact (number of people who
tried to contact sellers).

2.3 Social Media Data


PLJ shared Twitter data from 2014 with the participants. The data
provided are on 30 minutes interval, and are geo-tagged Twitter
data. The data provided are all aggregated at the district level.

2.4 Household Survey Data


PLJ provided data of Pemutakhiran Basis Data Terpadu (PBDT) in
2015 from the National Team for Alleviating Poverty (TNP2K), in-
cluding household and individual data. The shared data are anonymised.
PLJ also provided the BDT from Ministry of Social A�airs from
2015, 2017, 2018, such as aggregated data at the village level. Addi-
tionally, Indonesian Family Life Survey year 2015 from Surveymeter
also shared. Data from Indonesian Family Life Survey, which took
the sample of about 83% of the Indonesian population, contains
more than 30.000 individuals living in 13 of the 27 provinces in the
country. The survey consisted of two questionnaires: household
and community or facility. The household survey includes data on
consumption, household characteristics, education, employment,
and health. The community or facility data includes village facility,
access to social welfare program, health facility, education facility,
and facility for economy activity.

3 DATA AND TASK MAPPING


We de�ned four research questions with a di�erent dataset for each
question. Participants were allowed to use other data sources that
2
Figure 1: Nighttime Satellite Data

Table 1: Example of tweets per-district

Column name Column description Example of value


cat_l3_id Category level 3 id 4695
cat_l3_name Category level 3 name Mobil CBU
cat_l2_id Category level 2 id 198
cat_l2_name Category level 2 name Mobil Bekas
cat_l1_id Category level 1 id 86
cat_l1_name Category level 1 name Mobil
cat_group Category type (Inner / Inner
Outer) - Inner category:
selain jasa dan property

Table 2: Example of tweets per-district

Column Name Type Sample Description


TIMESTAMP Datetime (yyyy-mm-dd 2014-01-01 05:00:00 Timestamp of tweet post activity
HH:MM:SS)
PROV int 51 Code of province
KAB int 5102 Code of region
KEC int 5102011 Code of district
PROV_NAME string Jawa Barat Name of province
KAB_NAME string BEKASI Name of region
KEC_NAME string BEKASI BARAT Name of dsitrict
LAT �oat 106.47743 Coordinate for latitude
LONG �oat -6.23672 Coordinate for longitude
GENDER string female Gender
SOURCE string Others Source of tweet (web/ others)
CONTENT string Tweet posted by others

3
Table 3: Example of Household Survey Data

Variable data Coloumn


Jumlah keluarga jml_keluarga
Jumlah anggota rumah tangga jml_art
Keterangan perumahan
Milik sendiri sta_bangunan1
Milik orang lain sta_bangunan2
Luas bangunan luas_bangunan (rata-rata)
Jenis lantai terluas
Marmer/granit jns_lantai1
Kepemilikan aset dan keikutsertaan pro-
gram
Rumah tangga dan aset bergerak
Tabung gas 5,5 kg atau lebih gas_55
Kepemilikan aset tidak bergerak
Lahan lahan (luas lahan = rata-rata)
Rumah di tempat lain rumah_lain
Kepemilikan kartu pintar, sejahtera, dan
lainnya
kartu keluarga sejahtera (KKS)/ kartu perlin- kartu_bansos1
dungan sosial (KPS)
kartu indonesia pintar (KIP)/ bantuan siswa kartu_bansos2
miskin (BSM)

4
Estimating Poverty at the Provincial Level with Satellite Data
Nursida Arif Pamungkas Jutta Prahara
Universitas Muhammadiyah Gorontalo Pulse Lab Jakarta
Gorontalo, Indonesia Jakarta, Indonesia
[email protected] [email protected]

ABSTRACT as agriculture, plantation, animal husbandry and �sheries. In ac-


Research suggest that satellite data could be used as a proxy for cordance with the characteristics of its activities, land use in rural
a di�erent parameters, including urbanisation, density, and eco- areas tends to use large land units with low intensity of use, which
nomic growth. One of the parameters that could be extracted from means it tends not to be built land. Therefore, in this research land
satellite images is land use. In the research, land use classi�cation use classi�cation is explained into two (2) i.e built up area and not
has been done using Landsat 7 image and topographic map with vi- built up as an indicator of vulnerability to poverty.
sual interpretation method. The research took Yogyakarta Province
for study case for its diverse historical poverty line and regions. 2 METHODS
In Kulon Progo, the region with high poverty and vulnerability Land use classi�cation has been done using Landsat 7 image and
is generally an area with a small population and physically land, topographic map with visual interpretation method. Land use is
including disaster prone areas. explained into �ve (5) i.e. forest/garden, rice �eld, moor/�eld, open
�eld, settlement. Each class is grouped into two classes, i.e., the
KEYWORDS land is built and non-built as shown in �gure 1.
satellite imagery, land use, poverty line, visual interpretation, vul-
nerability, poverty indicators, proxy indicators

1 INTRODUCTION
Satellite imagery is increasingly available for free at certain resolu-
tion for global scale and contains a lot of information at pixel level
that could be associated with economic activity. Many research
suggest that satellite data can be used as a proxy for a number of
variables, including urbanization, density, and economic growth.
The environment can be a parameter of poverty but the complex
nature of the population making it di�cult to measure its in�uence.
Spatially the e�ect of environment on poverty can be done with
remote sensing approach. Although there is no evidence that satel-
lite prediction is better in poverty estimation than in conventional
censuses but this approach can be used as support.
One of the parameters that can be extracted from satellite images Figure 1: Land Use Class Groups
is land use. Some previous researchers used the land-use approach
as an indicator of poverty [3] [5]. An increase in population will trig- Yogyakarta Province is choosen for the case study in the research.
ger land clearing for various interests such as settlement, farmland The rationale are because of the historical population living below
or industry. [6] describes one perspective of land use as a functional poverty line has diverse conditions (low to high) and regions (urban,
space intended to accommodate diverse uses. In this perspective rural, coastal).
the land accommodates the growth of the area driven by popula-
tion growth and economic expansion. Rural areas have di�erent 3 DISCUSSION
characteristics with urban areas. According to Law No. 26 of 2007 Geographically, Yogyakarta is a region complete with various forms
and Minister of Public Works Regulation No. 41 of 2007, rural areas of land. In the northern region is the form of volcanic soil, the
are areas that have major agricultural activities including natural eastern and western regions are the plateau, the middle and the
resource management with the arrangement of regional functions south are the lowlands. Di�erent physiographic conditions have
as a place of rural settlements, government services, social services an impact on population distribution and economic progress. So
and economic activities. that the level of poverty will be more vulnerable in areas with
In contrast to urban areas dominated by non-agricultural activ- hilly topography and mountains. This is evidenced by the land use
ities. [4] in more detail de�nes the pattern of cities can be seen shown in �gure 2
from the existence of built areas such as settlements and infrastruc- Figure 2 shows the residential area of Sleman Regency is dom-
ture, while the pattern of villages is dominated by agricultural land, inantly distributed in the southern Sleman region bordering Yo-
forest land and settlement patterns are small and not centralized. gyakarta and Central Sleman. In Kulonprogo District dominant
Rural land is mostly used for mining and agrarian activities, such settlements in the south and east. Bantul District, the dominant
Figure 2: Yogyakarta Land Use Map Figure 3: Land Use Maps with Built Up Area

settlement in the center is similar to the pattern in Gunung Kidul


Regency. Percentage of settlements or built areas compared to the
area of each based on the image used is presented in table 1.

Table 1: Area of each region vs. the area built

Area (Km2 ) Built up area (%)


Sleman 573.9 27.83
Kulonprogo 581.73 9.79
Yogyakarta City 32.97 84.6
Gunung Kidul 1,476 13.01
Bantul 514.3 21.6
Analysis, 2018

Table 1 shows the city of Yogyakarta is an area with a percentage


of settlements / built areas greater than other regions because of
its location as the capital of the Province to become the center of
the economy. While the area with the percentage of smaller built
up area that is Kulonprogo (9.7%) and Gunung Kidul (13%). The
results of these analyzes indicate that Kulonprogo and Gunung
Kidul are vulnerable to poverty compared to other regions. Based
on this, further analysis will be speci�ed to Kulon Progo Regency
to test the correlation of image interpretation with the condition
of the population. Land use maps with built-up areas are shown in
�gure 3.
Table 2 shows that 12 sub-districts in Kulonprogo regency which
have the smallest percentage of built up area that is Lendah, Pan-
jatan, Kokap, Galur, Temon and Sentolo. While the region with Powered by TCPDF (www.tcpdf.org)

the most populations of Pengasih, Sentolo and Wates. Graphs of


population and built-up areas are shown in Table 2. Figure 4: Land use map with built area in Kulonprogo
Based on Figure 5, the results of the analysis show that the areas
that are vulnerable to poverty are the sub-districts of Galur, Kokap
and Panjatan where the number of areas is slightly built and the
number of small populations compared to other sub-districts.While poverty is supported by the physical condition of the land that is
District Pengasih, Wates and Sentolo have a smaller degree of vulnerable to disaster. [2] assessed the landslide in some areas of
vulnerability. Kokap as one of the districts that are vulnerable to Kokap Sub-district explained that the actual landslide process that
6
Table 2: Area of each region vs. built area vulnerability and other social factors such as income and nutrient
intake per household.
Area (Km2 ) Built up area Population
(Km2 ) (%) REFERENCES
[1] Nursida Arif, Projo Danoedoro, and Hartono Hartono. 2017. Remote Sensing and
Girimulyo 58.31 7.67 13.16 12,417 GIS Approaches to A Qualitative Assessment of Soul Erosion Risk in Serang Wa-
Temon 38.86 2.50 6.43 13,452 tershed, Kulonprogo, Indonesia. Geoplanning: Journal of Geomatics and Planning
4, 2 (2017), 131–142.
Galur 31.51 1.26 4.00 16,182 [2] Suprapto Dibyosaputro. 1992. Longsorlahan di Daerah Kecamatan Samigaluh,
Samigaluh 65.76 8.36 12.72 16,729 Kabupaten Kulon Progo, Daerah Istimewa Yogyakarta. Technical Report. Yo-
gyakarta.
Kalibawang 48.87 8.33 17.05 17,277 [3] Gilvan R Guedes, Leah K VanWey, James R Hull, Mariangela Antigo, and Alisson F
Nanggulan 38.86 8.71 22.42 17,360 Barbieri. 2014. Poverty dynamics, ecological endowments, and land use among
Panjatan 43.95 1.22 2.77 19,020 smallholders in the Brazilian Amazon. Social science research 43 (2014), 74–91.
[4] Fransiscus Xaferius Herwirawan, Cecep Kusmana, Endang Suhendang, and Widi-
Kokap 72.42 2.60 3.59 19,430 atmaka Widiatmaka. 2017. Changes in Land Use/Land Cover Patterns in Indone-
Wates 32.43 4.88 15.04 19,582 sia’s Border and their Relation to Population and Poverty. Jurnal Manajemen
Lendah 36.94 0.77 2.09 22,544 Hutan Tropika 23, 2 (2017), 90–101.
[5] Eddie CM Hui, Jiawei Zhong, and Kahung Yu. 2016. Land use, housing preferences
Pengasih 60.55 6.07 10.02 25,528 and income poverty: In the context of a fast rising market. Land Use Policy 58
Sentolo 53.27 4.20 7.89 27,812 (2016), 289–301.
[6] Edward J Kaiser, David R Godschalk, and F Stuart Chapin. 1995. Urban land use
Analysis, 2018 planning. Vol. 4. University of Illinois Press Urbana, IL.

Figure 5: Graphs of population and built-up areas in


Kulonprogo

occurs in the hills denudasional. The same is mentioned by Pan-


jatan and Kokap sub-districts included in the class of very serious
erosion susceptibility [1].

4 CONCLUSIONS
The results of land use analysis, poverty-prone areas in the Province
of Yogyakarta Special Region of Kulon Progo and Gunung Kidul
since the number of built areas smaller than other regions. In Kulon
Progo, the region with high poverty and vulnerability is generally
an area with a small population and physically land, including
disaster prone areas such as Kecamatan Kokap.

5 LIMITATION AND RECOMMENDATION


This study uses medium-resolution imagery with analysis at the
Provincial level so that land use classi�cation can not be done in
more detail. Poverty is very complex if only done based on land
use. An assessment of poverty estimates using satellite imagery
should be used for smaller areas at sub-district and village levels
with high-resolution imagery. So a more detailed analysis can show
the type of roof, the type of settlement patterns, access roads and
other attributes that support the identi�cation of poverty levels.
Future research can be estimated poverty through the integration
of remote sensing images with other parameters such as hazard
7
Estimating City-Level Poverty Rate based on e-Commerce Data
with Machine Learning
Dedy Rahman Wijaya Ni Luh Putu Satyaning Ana Uluwiyah
Telkom University and Institut Pradnya Paramita Education and Training Center,
Teknologi Sepuluh Nopember Institut Teknologi Sepuluh Nopember Statistics Indonesia
[email protected] [email protected] [email protected]

Dwi Rani Puspita Muhammad Rheza Annisa Zahara


Institute for Economic and Social Pulse Lab Jakarta Pulse Lab Jakarta
Research, Universitas Indonesia Jakarta Pusat, Jakarta Jakarta Pusat, Jakarta
[email protected] [email protected] [email protected]

ABSTRACT Vector Regression (SVR) and Arti�cial Neural Network (ANN). Con-
Indonesia abundantly produces big data from various resources, sidering the representation of e-commerce in certain areas in In-
e.g. social media, �nancial transaction, transportation, call detail donesia, the scope of this research is 118 cities in Java island.
records, e-commerce. These types of data have been considered as
potential resources to complement periodic survey, even census, to 2 DATASET
monitor development indicators in which poverty rate is included. The main dataset used in this research is the advertisements of
This research aims to estimate poverty rate at city-level based on e- goods posted in one of the big e-commerce platforms in Indonesia,
commerce data using machine learning methods i.e. Support Vector OLX. The following goods are included in the analysis.
Regression (SVR) and Arti�cial Neural Network (ANN). Feature This study utilizes two main data sources that complements each
selection has been performed with Fast Correlation-Based Filter other.
(FCBF). The result shows that ANN-based model predicts the city- (1) Car
level poverty rate very well, with high accuracy, low error and low (2) Motorbike
bias. This research suggests that e-commerce is potential to be used (3) House to sell
as proxy for city-level poverty rate. (4) House to rent
(5) Apartment to sell
KEYWORDS (6) Apartment to rent
TBC (7) Land to sell
(8) Land to rent
1 INTRODUCTION For each of those goods, the information of number of items
sold, price sold, number of viewers, and number of buyers were
The poverty rate is the ratio of the number of people whose income
extracted. Then, the aggregation by city has been done for each of
falls below the poverty line; taken as half the median household
those information per goods to calculate statistics measurements
income of the total population [1]. In Indonesia, poverty rate is
i.e. sum, average, and standard deviation, to capture both central
produced yearly by Statistics Indonesia by conducting a National
tendency and variation of the data. In total, there are 96 initial
Social and Economic Survey [4]. Despite all the bene�ts conducting
features extracted for this research.
this survey regularly every year, there are a couple of limitations
As the ground-truth, the poverty rate at city-level published by
such as (i) inability to gather information on poverty rate in between
Statistics Indonesia (see Table 1). For both e-commerce and o�cial
the surveys and (ii) requirement of certain resources in order to
data, the data in 2016 has been used for this research.
conduct the surveys.
Along with that, Indonesia abundantly produces big data from Table 1: Train-and-test splitting procedure
various resources, e.g. social media, �nancial transaction, trans-
portation, call detail records, e-commerce, etc. These types of data
Split 1 Split 2
have been considered as potential resources to complement periodic
survey, even census, to monitor development indicators in which Odd observation Train Test
poverty rate is included. As a dimension of economics, poverty rate Even observation Test Train
could be highly correlated with the data related to consumption
and purchasing power, which can be potentially represented by
e-commerce data. According to [3], revenue in the e-commerce 3 METHODOLOGY
market in Indonesia amounts to USD 9,138m in 2018, with user
penetration is at 40% in 2018 and is expected to hit 48,3% in 2022. 3.1 Pre-processing
This research aims to estimate poverty rate at city-level based Given 96 features and city i = i,..,118, normalisation has been done
on e-commerce data using machine learning methods i.e. Support for each feature with the following formula:
Figure 1: Poverty rate (%) at city-level, 2016 [5]

of features/predictor redundant. The values of C-Correlation and


x i min(x i ) F-Correlation are measured by Symmetrical Uncertainty (SU) to
zi = (1)
max(x i ) min(x i ) 0 calculate non-linear correlations between two discrete random
z is the normalised value of the respected features, x is the original variables V and W. SU can be expressed as follows:
value of the respected features, min(x) and max(x) are the minimum 
IG(V |W )
and maximum value of .To perform the analysis, the data of 118 SU (V ,W ) = 2 (2)
cities has been split into training and testing data with the procedure H (V ) + H (W )
shown in Table 2. where,
IG(V |W ) = H (V ) H (V |W ) (3)
3.2 Feature Selection Õ
In this research, feature selection is done with Fast Correlation- H (V ) = i =1 P( i )lo 2 (P( i )) (4)
Based Feature (FCBF). FCBF includes two aspects such as (i) decide
Õ
whether a feature is relevant to the class or not and (ii) decide H (W ) = i = 1P(w)lo 2 (P(w i )) (5)
whether such a relevant feature is redundant, i.e. correlated with
other features, or not. Õ Õ
The main principle of the FCBF algorithm is to maximize C- H (V |W ) = j =1 P(Wj ) i =1 P( i |w i )lo 2 (P( i |w i )) (6)
Correlation and minimize F-Correlation. Maximizing C-Correlation where P( i ) is the prior probabilities for elements V and P( i |w i )
aims to �nd out which features are most closely related to the is the posterior probabilities of V to the W value. The range of SU
class label which means it is possible to predict the class label values ranges between 0 and 1. The higher the SU value the higher
well. While minimizing F-Correlation means reducing the number the correlation between the two variable [7]. In this experiment,
10
we use SU threshold = 0 to provide feature opportunities with a 3.4 Arti�cial Neural Network (ANN)
small SU. ANN contains three layers including input, hidden, and output
With FCBF, 29 features are successfully selected from 96 initial layer. These layers are built from several neurons that convert the
features. signals based on the connection weight, bias, and activation func-
tion. Figure 2 shows the neural network architecture constructed
3.3 Support Vector Regression (SVR) for this research. The input layer contains 29 neurons correspond
SVR has the same concept as Support Vector Machine (SVM). Con- to the input features. Moreover, the hidden layer has 200 neurons
sidering a set of data training, {(X 1 , Z 1 ), ..., (X 1 , Z 1 )} corresponds with tanh activation function. Finally, the output layer only has
to the response of sensors where {X i , Z i } are feature vector and tar- one neuron to accommodate the continuous outputs in regression
get output, respectively. In -SVR, the primal optimization problem tasks.
can be expressed as follows [6]:

1 T ’l ’l
min W W +C i +C ⇤i (7)
w,b, ,xi⇤ 2
i=1 i=1

subject to w T (X i ) + b Z i  + ⇤i Z i w T (X i ) b

math  + ⇤i , , ⇤ 0, i = 1, ..., l
while w, C, , , b denote slope matrix, regularization parameter,
slack variable for soft margin, the margin of tolerance, and the
intercept/bias, respectively. The symbol (X i ) indicates mapping
X i into higher dimensional space. The dual problem optimization
is given by

1 ’
l ’
l
min ( ⇤)T Q( ⇤) + ( + ⇤) + Zi ( ⇤) (8)
, ⇤2
i=1 i=1

subject to eT ( ⇤) = 0, 0  , ⇤  C, i = 1, ... , l
where and * denotes Lagrangian multipliers. Q i, j = K(X i , X j ) ⌘ Figure 2: Architecture of ANN
(X i )T (X j ) and e = [1, ..., 1]T . In linear SVR, the decision function
is expressed by
3.5 Performance Measurement

l
( i (9) The performance of the models measures by three metrics, i.e. root
Y = i ⇤)hX i , X i +b
i=1
mean squared error (RMSE), bias factor, and accuracy factor. The
equation and description of those measurements detailed in the
In non-linear SVR, the kernel function e.g., RBF transforms the
Table 3.
data input into a higher dimensional feature space to perform the
linear separation. The decision function is computed by
4 RESULT AND CONCLUSION

l Three following models have been built in this research.
Y = ( i i ⇤)h (X i ), (X )i + b (10) (1) SVR-based model with 96 features
i=1 (2) SVR-based model with 29 features selected through FCBF
procedure

l
( i (11) (3) ANN-based model with 29 features selected through FCBF
Y = i ⇤)hK(X i , X )i +b
i=1
procedure
The RBF kernel is used to deal with non-linear data that can be As discussed in the methodology section, the models perfor-
computed by the following equation. mance is assessed by three metrics, i.e. RMSE, accuracy factor, and
bias factor. The performance of the three models are shown in Table
4. RMSE of SVR-based model with FCBF (4.3363) is only slightly
K(X i , X ) = exp( ||X i X || 2 ) (12)
lower than SVR-based model without feature selection (4.9037),
In this experiment, grid search is performed to determined pa- showing that with or without feature selection, SVR-based model
rameter from [0.01, 0.1, 1, 10, 100, 1000] and parameter from [0.01, produces almost the similar error from the estimation of poverty
0.1, 1, 10, 100]. rate at city-level.
11
Table 2: Performance measurements

Metric Equation Description


RMSE q ÕL RMSE is used to measure the di�erence/error
RMSE( , ˆ) = i =1 ( i ˆ i )2 between actual and prediction vector. The lower
L
RMSE value means the less di�erence between
actual and prediction values.
Bf Õ Bias factor denotes whether the predictions are
L
i =1 (l n( i ) ln( ˆ )) ”under” or ”over” estimate against actual val-
B f ( , ˆ) = exp L
ues. The unbiased prediction is indicated by B f
equals to 1. xxxx means that the prediction re-
sult is lower than actual values (underestimate)
and vice versa [2]
Af q Õ Accuracy factor measures the average accuracy
l n( ˆ ))2 of the prediction model. The value of is equal
L
i =1 (l n( i )
Af ( , ˆ) = exp L
or greater than one. The larger value than one
indicates less accurate prediction results [2]

Meanwhile, the ANN-based model with FCBF produces much time series or panel data, and data from di�erent e-commerce
smaller RMSE, i.e. 0.2725, than the two precedent SVM-based mod- platforms.
els. This number indicates that estimating poverty rate at city-level
with ANN-FCBF gives very good accuracy, since the RMSE value REFERENCES
almost reaches zero value. Although all three models predict the [1] Organisation for Economic Co-operation and Development. 2018. Poverty Rate.
poverty rate lower than the actual (underestimate), indicated by the (2018). https://data.oecd.org/inequality/poverty-rate.htm.
[2] C. P. T. R. Baranyi J. 1999. Validating and comparing predictive model. Journal
value of bias factor that is less than 1, the bias factor for ANN-FCBF of Food Microbiology 3 (1999).
model is only slightly less from 1 (0.9981). Moreover, the ANN- [3] Statistia. [n. d.]. ([n. d.]). https://www.statista.com/outlook/243/120/ecommerce/
FCBF model gives almost the accurate prediction since the value of indonesia
[4] Badan Pusat Statistik. [n. d.]. ([n. d.]). https://microdata.bps.go.id/mikrodata/
accuracy factor is very close to 1 (1.0007). index.php/catalog/SUSENAS/about
Figure 2 and Figure 3 show predicted poverty rate based on [5] Badan Pusat Statistik. [n. d.]. Persentase Penduduk Miskin Menurut Kabu-
paten/Kota, 2015 - 2017. ([n. d.]). https://www.bps.go.id/dynamictable/2017/08/
SVR-FCBF and ANN-FCBF, respectively, compared with the actual 03/1261/persentase-penduduk-miskin-menurut-kabupaten-kota-2015---2017.
poverty rate. The actual poverty rate is sorted for better visualiza- html
tion. The prediction of poverty rate produced by ANN-FCBF follows [6] V.N. Vapnik. 1998. Statistical Learning Theory.
[7] H. L. L. Yu. 2003. Feature Selection for High-Dimensional Data: Fast Correlation-
the actual poverty rate, with around one-third of city-level poverty Based Filter Solution. Twentieth International Conference on Machine Learning
rate are predicted exactly and precisely the same as the actual one. (ICML) (2003).
Cities with high actual poverty rate are relatively di�cult to predict
its poverty rate by SVM-FCBF model.

Table 3: Example of tweets per-district

Measurement SVR SVR-FCBF ANN-FCBF


RMSE 4.9037 4.3363 0.2725
Accuracy factor 1.2853 1.1772 1.0007
Bias factor 0.9277 0.9883 0.9981

If the city-level poverty rate represented by ranges, as shown by


the maps in Figure 4, Figure 5, and Figure 6, ANN-FCBF model pre-
dicts the category of poverty rate exactly the same as the category
of actual poverty rate. Based on the results discussed above, this
research concludes and suggests that:
(1) e-commerce data is potential to be used as proxy for city-
level poverty rate,
(2) e-commerce data can be used to complement o�cial data to
poverty rate in between surveys and censuses,
(3) in the future, the method presented in this research is poten-
tial to be replicated and scaled-up for all cities in Indonesia
or other administrative levels (i.e. province and sub-district),
12
Figure 3: Predicted city-level poverty rate based on SVR-FCBF

Figure 4: Predicted city-level poverty rate based on ANN-FCBF

Figure 5: Actual city-level poverty rate map

Figure 6: Predicted city-level poverty rate based on SVR-FCBF map

Figure 7: Predicted city-level poverty rate based on ANN-FCBF map

13
Estimating Poverty at the District Level with Social Media
Extended Abstract
Lili Ayu Wulandhari Sri Redjeki Yunita Sari
Bina Nusantara University STMIK AKAKOM Yogyakarta Universitas Gadjah Mada
Jakarta, Indonesia Yogyakarta, Indonesia Yogyakarta, Indonesia
[email protected] [email protected] [email protected]

Widaryatmo M. Rizal Khae�


Bappenas Pulse Lab Jakarta
Jakarta, Indonesia Jakarta, Indonesia
[email protected] muhammad.khae�@un.or.id

ABSTRACT method conducted by National Statistics Agency (BPS). This method


The rise of social media has encouraged data driven research in may provide precise results, albeit expensive. Compared to survey-
many disciplines. Previous studies have utilised Twitter, a popular based method, Twitter provides a huge data volume which can
micro-blogging service as a valuable information resource for vari- be obtained at no-cost. In addition, Twitter enables real-time and
ous applications including demographic identi�cations and event direct surveillance which make it useful for detecting economic
detections. This study explores how information from Twitter can dynamic in the society.
be used as a leading poverty indicator which supports data obtained Nevertheless, extracting poverty-related information from Twit-
from survey-based method. We de�ne an array of poverty-related ter is challenging. Thus, the aim of this study is not to replace
keywords based on Socio Economic Survey (SUSENAS) conducted the survey-based method as the main source of the poverty data,
by National Statistic Agency (BPS) and apply a semi-supervised ma- but to provide additional information to support the survey-based
chine learning algorithm called Pseudo Labeling to identify tweets data. This study explores how information from Twitter content
with poverty indicator. Results of our experiments on subset of can be used as leading poverty indicator. Experiments are carried
Twitter data in JABODETABEK area show our pre-de�ned key- out using subset of Twitter data in JABODETABEK (Jakarta Bogor
words are e�ective by obtaining 65% on accuracy. This study acts as Depok Tangerang Bekasi) area within the year of 2014. We �rst
a preliminary benchmark for further work on identifying poverty de�ne an array of poverty-related keywords based on data from
from social media especially in Bahasa Indonesia. BPS. We then �lter tweets that contain those keywords and perform
human annotation to assign poor and non-poor labels to the selected
KEYWORDS tweets. We assume that Twitter users are middle-upper class soci-
ety with good �nancial condition. Thus, the labels assigned do not
Poverty indicator, Twitter, Semi-supervised Learning represent the poverty level of an individual user but rather indicate
the poverty condition in an area where the users reside/post their
1 INTRODUCTION tweets. Labelling Tweets to generate training and test datasets is a
labour-intensive process. Thus, we apply Pseudo-Labeling [10], a
According to a survey conducted by Indonesian Internet Service
semi-supervised machine learning algorithm to classify tweets into
Providers Association (APJII) in 2017, Indonesia had 143.26 million
the pre-de�ned classes. In this way, the model is not only learning
internet users which covered nearly 54.68% of the total population
from the labeled data but also from the large size of unlabeled data.
of the country [2]. This made Indonesia in the top �ve countries
The experiment results show that the pre-de�ned poverty-related
with the highest internet users after China, India, USA, and Brazil.
keywords are e�ective to identify tweets with poverty indicator.
The survey revealed that 87.13% of the users utilised the inter-
Results from this study act as a benchmark for further work on
net to access social media. APJII reported that Twitter, a popular
identifying poverty from social media especially in Bahasa Indone-
micro-blogging service, is one of the most visited social media [1].
sia.
Previous studies have indicated that user demographic pro�les such
The rest of the paper is organised as follows: in Section 2 we
as age, gender, education background, income, and occupation can
describe related work which use Twitter as the main source of data.
be predicted from their posts in Twitter [6, 8, 11, 13]. In addition,
Section 3.1 presents the details of the data and how the human anno-
Twitter has been widely used as a valuable information resource for
tation is conducted. Section 4 describes our methodology to identify
various applications including earthquake detection [14], stock mar-
poverty indicator on Twitter. Results and analysis are presented in
ket prediction [5] [4], crime prediction [9] and analysis of politic
Section 4. Finally we summarize our �ndings, and describe possible
power distribution in election [7]. Among the numerous potential
work to be explored further in Section 5.
applications, this study addresses the issue of identifying poverty
indicator which presents outstanding advantages over traditional
survey-based methods. Currently, poverty-related data (i.e. amount 2 RELATED WORK
of income, source of income, number of dependent family member, The vast availability of social media data has encouraged data driven
and living condition) in Indonesia is obtained through survey-based research in many disciplines. Twitter is one of the social media
Figure 1: Flowchart of Research Methodology

which serves as a potential resource for many applications. Sakaki their dataset. Burger et al. used the content of the tweets and three
et al. [14] investigated the real-time interaction of events such user pro�les including full name, screen name, and description to
as earthquakes, in Twitter and proposed an algorithm to moni- discriminate the gender of the users. Their approaches successfully
tor tweets and to detect a target event. They used Support Vector obtained the best accuracy of 92%. In addition, by using only the
Machine (SVM) and three feature groups for event detection and content of the tweets, the model gained 76% accuracy.
applied Kalman and particle �lters for location estimation. By con- Similar to Burger et al., Flekova et al. [8] explored stylistic varia-
sidering each Twitter user as a sensor, Sakaki et al. constructed a tion with age and income on Twitter. By using variety of features
reporting system which detects earthquake promptly and sends such as word and character lengths, readability measures (i.e. the
noti�cation e-mails to registered users. They reported that their Automatic Readability Index, the Flesch Kincaid Grade Level, the
system is able to deliver noti�cation faster than the announcements Flesch Reading Ease), part-of-speech (POS), and contextuality mea-
broadcasted by the Japan Meteorological Agency (JMA). sure combined with linear regression, they successfully predicted
Another work by Aramaki et al. [3] addressed the issue of detect- age and income groups of the Twitter users. Some interesting �nd-
ing in�uenza epidemic using Twitter. They constructed an in�uenza ings are Flesch Reading Ease–previously reported to correlate with
corpus consists of 0.4 million tweets which are divided into train- education levels at a community level–is highly indicative for in-
ing and testing parts. Using two pre-de�ned conditions, human come. In addition, the increased use of nouns, determiners and
annotation was conducted to assign positive and negative labels to adjectives is correlated higher with age as opposed to income.
the tweets. Several machine learning classi�ers with Bag Of Words
(BOW) features were used to identify whether a given tweet is
positive or negative. Their experiment results outperformed the 3 METHODOLOGY
state-of-the-art Google method by obtaining high correlation (cor-
relation ratio=0.89). This research is conducted in four steps, namely data preparation,
Previous work has also tried to predict demographic attributes data preprocessing and annotation, pseudo labeling and evaluation.
from Twitter which are useful for marketing, personalization, and Data preparation aims to comprehend and analyze information
legal investigation. A work by Burger et al. [6] constructed a large, contained in the data. This step becomes the justi�cation which
multilingual dataset labeled with gender and explored several sta- techniques are chosen for preprocessing. Preprocessing step is con-
tistical approaches for identifying gender of Twitter users. In order ducted to extract keyword as one of leading indicator from raw
to assign gender labels to the Twitter accounts, the authors sampled data. Result from preprocessing step is used in pseudo labeling algo-
the corresponding user pro�les which were obtained by following rithm. Detail explanation of data and pseudo labeling is presented
the Twitter URL links to several of the most represented blog sites in in subsection 3.1 and subsection 3.3

15
Table 1: Example of Twitter Data

PROV_ KAB_ KEC_ TIME


LAT LON PROV KAB KEC GENDER CONTENT SOURCE
NAME NAME NAME STAMP
Apapn statusnya
miskin atupun
kaya tua muda
106. 140 Twitter
-624. Jawa BEKASI sehat sakit sibuk
980. 32 3275 BEKASI 3275050 Female 950 for
022 Barat SELATAN lelah kewajiban
765 4693 Android
seorang hamba
harus tetap
dilaksanakan.
Orang miskin
jangan melawan
orang kaya
Daerah
106. dan orang kaya 140 Twitter
-6.168. Khusus JAKARTA
826. 31 3173 3173080 GAMBIR Male jangan melawan 973 Web
961 Ibukota PUSAT
796 pejabat.Kalau 8003 Client
Jakarta
mau mengubah
bangsa ini
jadilah PEJABAT.
gua kan jelek
106. 141 Twitter
-6.194. miskin bego
675. 36 Banten 3671 TANGERANG 3671020 CIPONDOH Male 042 for
607 ga akan selamanya
791 0787 Android
gua begitu

3.1 Dataset Details


In this research we analyze Twitter data for JABODETABEK area
for twelve months in 2014. From these twelve months, we exclude
January, June, July and December due to special occasions such as
new year and religious holidays happened in those months. Each
tweet contains information as follows:
• TIMESTAMP : Timestamp of tweet post activity
• PROV : Code of province
• KAB : Code of region
• KEC : Code of district
• PROV_NAME : Name of province
• KAB_NAME : Name of region
• KEC_NAME : Name of district Figure 2: Twitter users behavior in a week within JA-
• LAT : Coordinate for latitude BODETABEK area
• LONG : Coordinate for longitude
• GENDER : Gender of Twitter user
• SOURCE : Source of tweet (web/ others) From this data, poverty indicator keywords are extracted based
• CONTENT : Tweet posted by user on Socio Economic National Survey (SUSENAS) in 2018 (See Figure
Sample of Twitter data are presented in Table 1. In addition to 2). This survey presents that things related to food such as "beras"
that, we perform an analysis to the Twitter user’s behaviour. Based (rice) dominates society needs. Therefore, we extract food and
on data pro�ling, we obtain that most users post tweets on the poverty related words as the keywords. Overall, fourteen keywords
weekend, with the highest peak is on Saturday followed by Sunday and tweets metadata are used as the input for identifying poverty
and Friday (see Figure 2). While in a day, Twitter users tend to indicator. Each non numeric features; PROV NAME, KAB NAME
post in the span of 9.00 am - 4.00 pm around Jakarta Selatan and and GENDER is transformed into numeric form with following
Jakarta Pusat area (see Figure 3). Moreover, the analysis provide an de�nition:
information that Twitter is used actively around business center, PROV_NAME:
downtown and entertainment center area which spreads in Jakarta • Daerah Khusus Ibukota Jakarta:31
Selatan and Jakarta Pusat. This is equivalent to number of poverty • Jawa Barat: 32
content which is identi�ed from these area. • Banten: 36
16
3.2 Manual Annotation
Manual annotation aims to assign poor and non-poor labels to each
tweet. Each content of tweets is identi�ed manually to provide
training data for pseudo labeling. In total, there are 153 tweets have
been labeled. Example of tweets with poor and non-poor indicator
is presented in Table 3. This annotation also to determine whether
tweets contain pre-de�ned keywords, with value 1 if the keywords
are exist and value 0 otherwise.

Table 3: Example tweets with poor and non-poor indicator

class tweets
Figure 3: Twitter users behavior in a day within JA- poor Jumlah uang yg beredar sama, namun penduduk
BODETABEK area meningkat, anak2 itu tidak dapat lapangan kerja,
kemiskinan meningkat
poor Dan anyway, karena pengontrolan yg kurang itu,
KAB_NAME: negara dipuyengkan oleh beban generasi anak yg
• Jakarta Selatan: 1 tidak terkontrol, efeknya,kemiskinan meningkat
• Jakarta Timur: 2 poor Awal bulan padahal, udah miskin ajayaa
• Jakarta Pusat: 3 poor Negara miskin karena penduduknya sendiri. Yg
• Jakarta Barat: 4 miskin ngga berusaha maju yg kaya berlagak
• Jakarta Utara: 5 miskin punya mobil pribadi masih pake bbm sub-
• Kepulauan Seribu: 6 sidi
• Bogor: 7 poor Orang miskin jgn sakit biaya rumah sakit mahal.
• Depok: 8 Orang miskin kerja lebih keras lg bayar uang ku-
• Bekasi: 9 liah anakmu dua kali lipat penghasilanmu.
• Tangerang: 10 non-poor Ngedengerin ecen ngomong suaranya berat bgt
• Tangerang Selatan: 11 kqya bawa beras sekarung -_-
non-poor Kaya materi tapi miskin hati. Kasian
GENDER:
non-poor makanan buat brino... bubur kacang merah + beras
• Male : 1 merah + daging ayam cincang.. udh kyk baby ni
• Female: 2 ci binyo
This data is annotated manually into poor and non-poor classes to non-poor Cendol terbuat dari tepung beras trus di kasih pan-
be the input data in pseudo labeling approach dan gituu...??? Pantesss konyang wak dek nyoo...
:)
Table 2: List of commodities contribute to poverty (%) non-poor Minum beras kencur biar sixpack
(source: Socio Economic National Survey, March 2018)

3.3 Pseudo Labeling


We approach the task as a classi�cation problem by applying Pseudo
Labeling, a semi-supervised machine learning method to the Twit-
ter corpus described in Section 3.1. Figure 4 shows the illustration
of Pseudo Labeling. First, the model is trained on a small size of
labeled data. Then, using the trained model, we predict labels on
the unlabeled data creating pseudo-labels. The model is re-trained
on the combination of the labeled and newly pseudo-labeled data.
These steps are repeated until there is no more unlabeled data left.
It is expected that the model performance will improve when more
pseudo-label data added. In our experiment, we use the implemen-
tation of Pseudo Labeling from Scikit Learn [12]
Our model uses 17 numerical features include three tweets in-
formation: PROV, KEC, GENDER and 14 pre-de�ned keyword fre-
quencies: beras, sembako, miskin, paket sembako, beras murah, harga
beras, beras naik, harga naik, antri sembako, hidup susah, #berasnaik,
#bbmnaik, #sembako, #berasmahal. These keywords were de�ned
17
level. Thus, it can validate our initial assumption that most Twitter
users are from middle-upper class society.

ACKNOWLEDGMENT
We would like to acknowledge Pulse Lab Jakarta for organising
Research Dive event and providing the data. We also wish to thank
Prof. Arief Anshory Yusuf and Prof. Dedi Rosadi for their insightful
feedbacks.

REFERENCES
[1] APJI APJII. 2016. Penetrasi dan Perilaku Pengguna Internet Indonesia. Infogra�s
Hasil Survey (2016), 1–35.
[2] APJI APJII. 2017. Penetrasi dan Perilaku Pengguna Internet Indonesia. Infogra�s
Hasil Survey (2017), 1–39.
[3] Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. 2011. Twitter Catches the
Flu: Detecting In�uenza Epidemics Using Twitter. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing (EMNLP ’11). Association
for Computational Linguistics, Stroudsburg, PA, USA, 1568–1576. http://dl.acm.
org/citation.cfm?id=2145432.2145600
[4] Pablo D Azar and Andrew W Lo. 2016. Practical Applications of The Wisdom
of Twitter Crowds: Predicting Stock Market Reactions to FOMC Meetings via
Twitter Feeds. Practical Applications 4, 2 (2016), 1–4.
[5] Johan Bollen, Huina Mao, and Xiaojun Zeng. 2011. Twitter mood predicts the
stock market. Journal of computational science 2, 1 (2011), 1–8.
[6] John D. Burger, John Henderson, George Kim, and Guido Zarrella. 2011. Discrim-
inating Gender on Twitter. In Proceedings of the Conference on Empirical Methods
Figure 4: Pseudo labeling in Natural Language Processing (EMNLP ’11). Association for Computational
Linguistics, Stroudsburg, PA, USA, 1301–1309. http://dl.acm.org/citation.cfm?
id=2145432.2145568
based on Socio Economic National Survey (SUSENAS) in 2018 (see [7] Nugroho Dwi Prasetyo and Claudia Hau�. 2015. Twitter-based election prediction
details in Section 3.1). in the developing world. In Proceedings of the 26th ACM Conference on Hypertext
& Social Media. ACM, 149–158.
[8] Lucie Flekova, Daniel PreoŢiuc-Pietro, and Lyle Ungar. 2016. Exploring Stylistic
4 RESULTS AND ANALYSIS Variation with Age and Income on Twitter. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
Table 4 presents the results of our experiment. Using only 10 labeled Association for Computational Linguistics, 313–319. https://doi.org/10.18653/
data as initial seeds, our model obtained 65.73% on accuracy with v1/P16-2051
[9] Matthew S Gerber. 2014. Predicting crime using Twitter and kernel density
precision and recall of 64% and 66% respectively. Our results demon- estimation. , 115–125 pages.
strate that the chosen features are e�ective for this task. Moreover, [10] Dong-Hyun Lee. 2013. Pseudo-Label : The Simple and E�cient Semi-Supervised
the results provide evidences that the pre-de�ned keywords are Learning Method for Deep Neural Networks. (07 2013).
[11] Nikola Ljubešić, Darja Fišer, and Tomaž Erjavec. 2017. Language-independent
suitable for identifying tweets with poverty-related information. Gender Prediction on Twitter. In Proceedings of the Second Workshop on NLP and
The model’s performance can be improved by adding more pseudo Computational Social Science. Association for Computational Linguistics, 1–6.
labeled data. However, in our experiment we did not remove fake http://aclweb.org/anthology/W17-2901
[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
tweets which are generated automatically using bot software. Thus, Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
our results may not describe the real poverty condition in a partic- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
ular area. [13] Shigeyuki Sakaki, Yasuhide Miura, Xiaojun Ma, Keigo Hattori, and Tomoko
Ohkuma. 2014. Twitter User Gender Inference Using Combined Analysis of Text
Table 4: Experiment results and Image Processing. In Proceedings of the Third Workshop on Vision and Lan-
guage. Dublin City University and the Association for Computational Linguistics,
54–61. https://doi.org/10.3115/v1/W14-5408
precision recall F1-score accuracy [14] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes
Twitter users: real-time event detection by social sensors. In Proceedings of the
64% 66% 65% 65.73% 19th international conference on World wide web. ACM, 851–860.

5 CONCLUSIONS AND FUTURE WORKS


This paper describes experiments on identifying poverty indicator
on Twitter. We �nd that the tweets with poverty-related informa-
tion can be classi�ed using several pre-de�ned poverty-related
keywords. Pseudo labeling is proved to be suitable method for this
task since the available training data is limited. A possible extension
of this work is to apply pre-processing steps to remove fake twitter
accounts. This is to assure that only tweets from real account are
included. In addition to that it is interesting to investigate the corre-
lation between the authors’ writing style and their poverty/income
18
Estimating Poverty at the Household Level with Social Media
Data and Household Survey Results
Eka Puspitawati Hizkia H. D. Tasik Nurlatifah
Universitas Pertamina Sam Ratulangi University Statistics Indonesia
Jakarta, Indonesia Manado, Indonesia Bogor, Indonesia
[email protected] [email protected] [email protected]

M. Eko Fahillah Rajius Idzalika


TNP2K Pulse Lab Jakarta
Jakarta, Indonesia Jakarta, Indonesia
[email protected] [email protected]

ABSTRACT
Social media data, collected automatically through the interaction
of individuals, can provide insights on many emerging issues such
as from social life to politics. This extended abstract will explore
how social media data correlate to poverty measurement on both re-
gional and household levels using community/village level poverty
mapping and household poverty measurement survey. Discovering
ways to measure poverty through social media data o�ers a more
rapid and inexpensive measure of poverty compared to completing
poverty mapping or household surveys. We describe the statistical
techniques that allow us to evaluate the potency of poverty estima-
tion using social media data, particularly Twitter. We also discuss
follow-ups that can contribute to better estimations.

KEYWORDS
Poverty, Twitter, Household Survey
Figure 1: Twitter user penetration from 2012-2018, by coun-
try
1 INTRODUCTION
Household/individual poverty in Indonesia is usually measured
through household/individual expenditure level. This measurement while the IFLS data is used to investigate the correlation between
requires a survey on a representative sample of the household pop- social media and household poverty level measurement.
ulation in Indonesia. The household survey is generally expensive
and time-consuming, and certain time-frame constrains the data 2 TWITTER METADATA PERFORMANCE
generated from the survey. Expanding on available Twitter data, we �nd variation between dis-
To overcome that restriction, we investigate the available con- tricts in Jabotabek regarding total Twitter User IDs, some messages
nection between social media (Twitter) data and household survey posted, some mentioned in messages and total hashtag in messages.
data. Twitter users have signi�cantly increased in numbers for the Districts observed in Jabodetabek region are:
past six years (2012-2018) especially in Asia-Paci�c.
Since increased access to internet services boosts economic 3 REGIONAL LEVEL CORRELATION WITH
growth and improves the well-being of the poor [2], we are in- POVERTY
terested to see how the data from Twitter users (with internet SMERU Poverty map of Indonesia 2015 calculates various poverty
access) can explain the poverty in both regional and household measures based on several surveys conducted by Statistics Indone-
levels. sia. We use two poverty measures to explore against social media
We are using Twitter data of Jabodetabek region in 2014 available data: poverty headcount index and Gini coe�cient.
as part of Research Dive 7 [3] initiative by UN Pulse Lab. For the Poverty headcount index (P0) is de�ned as:
poverty measurement data, we use Poverty Map of Indonesia 2015
1 ’ z
representing poverty (headcount ratio and GINI ratio) at commu- q
10
nity/village level available from SMERU research institute, and the P0 = ( ) (1)
N i=1 z
Indonesian Family Life Survey (IFLS) [1] available from RAND Cor-
poration. The poverty map data is used to investigate the correlation Where:
between social media data and regional poverty level measurement, P0 : Headcount index
3101 Kepulauan Seribu
3171 Kota Jakarta Selatan Total Mention By District

150000
3172 Kota Jakarta Timur
134335
3173 Kota Jakarta Pusat
121782
3174 Kota Jakarta Barat
3175 Kota Jakarta Utara
107177

100000
101970

3201 Kabupaten Bogor


95138

sum of mention
91716
86831 88940

3216 Kabupaten Bekasi


3271 Kota Bogor
57777
3275 Kota Bekasi

50,000
3276 Kota Depok
3603 Kabupaten Tangerang
30706

3671 Kota Tangerang


3674 Kota Tangerang Selatan 59

0
3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674

Total User id By District


60,000

55736 56125
54494
Figure 4: Total mentions in Twitter message in Jabotabek re-
gion 2014, at district level
40,000
sum of total_user_id

40218
36475 37288 37383
33860

Total Hashtag By District


40,000

23495
20,000

36464

33028
8760
30,000

28792

150
sum of hashtag
0

3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674 21396 21010
20,000

19837 19589 19994

14169

Figure 2: Total Twitter User IDs in Jabotabek region 2014, at


10,000

district level 7080

129
0

3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674

Total Post By District


200000

203994

181752 Figure 5: Total hashtag in Twitter’s message in Jabotabek re-


167596 gion 2014, at district level
150000

151031 148659
142610
135476 136248
sum of total_post

100000

90900 Poverty headcount index directly indicates the ratio of people


below the poverty line in a region. A higher index indicates a higher
level of poverty.
50,000

Gini coe�cient (G1) is de�ned as:


46653

1 ’
264
N
0

G1 = 1 (2)
3101 3171 3172 3173 3174 3201 3275 3276 3603 3671 3674
(Yi + Yi 1 )
N i=1

Figure 3: Total Twitter message posted in Jabotabek region


Where:
2014, at district level
G1 : Gini coe�cient
N : Number of population
yi : Average monthly per-capita expenditure of household below
z : Poverty line the poverty line
y1 : Average monthly per-capita expenditure of household below
the poverty line Gini coe�cient indicates the equality of distribution, Gini coe�-
q : Number of the population below the poverty line cient of zero (0) represents perfect equality while the Gini coe�cient
N : Number of population of one (1) represents maximum inequality.
20
3.1 Correlation between Twitter’s aggregated
data and poverty headcount index
To explore the correlation between Twitter and regional poverty
headcount index at the community/village level, we need to aggre- Pr (Y = 1|X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) (4)
gate the Twitter data into the community/village level and com- 1
= (5)
pared the results. We use Ordinary Least Squares (OLS) and observe 1 + e ( 0 + 1X 1 + 2X 2 + 3X 3 + 4X 4 + 5X 5 + 6X 6 )
the resulted signi�cance of Twitter aggregated indicators towards
estimating poverty headcount:
Where:
i= 0 + 1ln(x i1 )+ 2ln(x i2 )+ 3ln(x i3 )+ 4ln(x i4 )+ 5ln(x i5 )+
Y: Quintile Gini Ratio (percent)
(3)
X1: Total User IDs of Twitter (number)
Where:
X2: Total post in Twitter (number)
yi : headcount index
X3: Average mention in Twitter per user (number)
xi1 : total number of unique Twitter User IDs
X4: Average hashtag in Twitter per user (number)
xi2 : total number of Twitter’s message posted
X5: Average link for Twitter per user (number)
xi3 : total number of mentions (@) in twitter message
X6: Average locations for Twitter per user (number)
xi4 : total number of link (http://) in Twitter message
(1,...6): Ordered log-odds (logit) regression coe�cients
xi5 : total number unique location (latitude/longitude) of Twitter
e : exponential
posts
: estimated regression coe�cients The results suggest that the headcount index has a stronger cor-
: errors relation with unique Twitter User IDs and average links posted
(sharing internet content resource). This outcome shows that a
The results suggest that the headcount index has a stronger
higher number of Twitter users and the higher level of the resource
correlation with unique Twitter User IDs and unique Twitter posts
sharing can indicate communities/villages with lower Gini quintile.
(mobility). The �ndings show that a higher number of Twitter users
and mobility of users can indicate communities/villages with lower
poverty headcount. Variables Y (Gini Ratio Outline)
Variables Y Poor head total_user_id -0.00199***
ln_total_user_id -0.11495** (0.00036)
(0.05778) total_post -0.00005
ln_total_post -0.05048 (0.00010)
(0.10393) avg_mention_per_user -0.01490
ln_mention 0.02122 (0.01311)
(0.06475) avg_hashtag_per_user -0.01452
ln_hashtag -0.00730 (0.03309)
(0.03431) avg_links_per_user 0.04284*
ln_links -0.09113 (0.02405)
(0.05894) avg_location_per_user -0.04441
ln_locations -0.21993*** (0.03525)
(0.06801) Constant cut1 -0.90845***
Constant -1.41331*** (0.09990)
(0.12818) Constant cut2 -0.17323*
Observations 1,008 (0.09498)
R-squared 0,44891 Constant cut3 -1.12987***
(0.10124)
***p<0.01, **p<0.05, p<0.1
Constant cut4 2.35951***
Table 1. Estimation results for correlation between Twitter (0.12148)
data and poverty headcount index Constant -1.41331***
(0.12818)
3.2 Correlation between Twitters aggregated Observations 1,199
data and Gini ratio quintile Pseudo R2 0.0729
To explore the correlation between Twitter and regional Gini Ratio
Quintile at community/village level, we take the same aggregation ***p<0.01, **p<0.05, p<0.1
process to prepare comparable Twitter datasets. We use Ordered
Logit method and observe the resulted signi�cance of Twitter ag- Table 2. Estimation results for the correlation between Twit-
gregated indicators towards estimating Gini quantile: ter data and Gini ratio quintile
21
4 HOUSEHOLD LEVEL CORRELATION WITH Variables Y (Expenditure)
POVERTY hp 3.425e-02***
(<2e-16)
4.1 Data Preparation male 3.679e-04
Twitter does not have information on users income or expendi- (0.1729)
ture. To provide such information, we borrow the information of tweets 1552e-05
household expenditure from the IFLS. The assumption involved is (0.1752)
that observable attributes on both data source are the matching fac- mentions 7.793e-06
tors, so that Twitter user with speci�c matching factors is assumed (0.3196)
to be identical with IFLS individuals with similar characteristics. hashtags -2.200e-05
This individual later is connected to their household ID to obtain (0.0837)
information on household expenditure. links -3.024e-05
The �rst step of this analysis is to seek the common factors on (0.0516)
both sides. We selected November 2014 as the time frame for Twit- location_count -2.510e-05
ter data, to match it with the IFLS survey wave 5 that only started (0.1006)
in October 2014, and to avoid anomalies during the year-end. We min_lat -5.122e-01***
also restricted the age of the Twitter user and IFLS individuals to (<2e-16)
be between 16-60 years old. Further, the common factors identi�ed max_lat 1.216e-01***
are gender (F/M), province, district and sub-district code of home (<2e-16)
location, and the type of device used (notebook/cellphone). The min_lon -6.750e-02***
matching factors are so limited, and we expect that upcoming sur- (<2e-16)
veys include more of internet related information for this proof of max_lon 1.404e-02***
concept to be more reliable. (1.68e-09)
From Twitter extraction, we obtain 128,761 users to match with distance_box 1.438e-02
3,204 of IFLS individuals and their respective households. (0.2667)
Adjusted R-sq 0.5287
p-value <2.2e-16
***p<0.01, **p<0.05, p<0.1

4.2 Matching based imputation and signi�cant REFERENCES


[1] RAND Corporation. 2015. The Indonesian Family Life Survey. (2015). https:
proxy from Twitter indicators //www.rand.org/labor/FLS/IFLS.html
The second step is to do imputation of Twitter user expenditure [2] Hernan Galperin and M. Fernanda Viecens. [n. d.]. Connected for Development?
Theory and evidence about the impact of Internet technologies on poverty al-
based on the similarity with IFLS individuals. Among the imputation leviation. Development Policy Review 35, 3 ([n. d.]), 315–336. https://doi.org/10.
methods, we simply utilize univariate method of linear regression 1111/dpr.12210 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/dpr.12210
to estimate the expenditure because our expenditure variable is [3] Pulse Lab Jakarta. 2018. Arti�cial Intelligence and Machine Learning for Estimat-
ing Poverty. (2018). http://rd.pulselabjakarta.id/research_dive/getdetail/21
continuous. Imputation is originally a method to �ll in missing
values using a credible and scienti�c way. According to Rubin[4],
inferences from multiple imputation when done properly is statisti-
cally valid. However, for experiment purpose we only produce one
imputed data set as the showcase.
After obtaining the imputed expenditure for Twitter user, we
regress by OLS expenditure on several common variables as well
metadata attributes such as type of device, gender, number of tweets,
number of mentions, number of hashtags, number of links, number
of unique locations for tweet posts, the maximum and minimum
of latitude and longitude of tweet posts, and the width of mobility.
The latter variable is derived from maximum and minimum latitude
and longitude.
A preliminary insight provided in Table 3 suggests that users
who access Twitter using cellphone have a signi�cantly higher
expenditure compared to a user who accesses Twitter via notebook.
From the spatial perspective, lower latitude and lower longitude
of tweet posts in Jabodetabek area are negatively correlated with
expenditure.

22
http://rd.pulselabjakarta.id/

Pulse Lab Jakarta is grateful for the generous support from


the Government of Australia

The covers of this report are printed on recycled paper

You might also like