R Bloggers3
R Bloggers3
R Bloggers3
Slides from my talks about Demystifying Big Data and Deep Learning (and how to get started)
Generating data to explore the myriad causal effects that can be estimated in observational data
analysis
Checklist Recipe – How we created a template to standardize species data
Zero Counts in dplyr
Cognitive Services in Containers
Slides from my talks about Demystifying Big Data and Deep Learning
(and how to get started)
Posted: 19 Nov 2018 04:00 PM PST
(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)
On November 7th, Uwe Friedrichsen and I gave our talk from the JAX conference 2018: Deep
Learning – a Primer again at the W-JAX in Munich.
A few weeks before, I gave a similar talk at two events about Demystifying Big Data and Deep
Learning (and how to get started).
Here are the two very similar presentations from these talks:
To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science,
Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave,
LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and
more...
(This article was first published on ouR data generation, and kindly contributed to R-bloggers)
I’ve been inspired by two recent talks describing the challenges of using instrumental variable (IV)
methods. IV methods are used to estimate the causal effects of an exposure or intervention when there
is unmeasured confounding. This estimated causal effect is very specific: the complier average causal
effect (CACE). But, the CACE is just one of several possible causal estimands that we might be
interested in. For example, there’s the average causal effect (ACE) that represents a population
average (not just based the subset of compliers). Or there’s the average causal effect for the exposed
or treated (ACT) that allows for the fact that the exposed could be different from the unexposed.
I thought it would be illuminating to analyze a single data set using different causal inference methods,
including IV as well as propensity score matching and inverse probability weighting. Each of these
methods targets different causal estimands, which may or may not be equivalent depending on the
subgroup-level causal effects and underlying population distribution of those subgroups.
This is the first of a two-part post. In this first part, I am focusing entirely on the data generation process
(DGP). In the follow-up, I will get to the model estimation.
Since the motivation here is instrumental variable analysis, it seems natural that the data generation
process include a possible instrument. (Once again, I am going to refer to elsewhere in case you want
more details on the theory and estimation of IV models. Here is an excellent in-depth tutorial by
Baiocchi et al that provides great background. I’ve even touched on the topic of CACE in an earlier
series of posts. Certainly, there is no lack of discussion on this topic, as a quick search around the
internet will make readily obvious.)
The figure below is a variation on the directed acyclic graph (DAG) that is often very useful in laying out
causal assumptions of a DGP. This particular figure is a type of SWIG: single world intervention graph.
SWIGs, developed by Robins and Richardson, fuse the worlds of potential outcomes and DAGs.
1. There is an instrumental variable \(A\) that has a direct causal relationship only to the
exposure of interest, \(T\). If the exposure is a particular medical intervention, think of the
instrument as some kind of encouragement to get that treatment. Some people get the
encouragement, others don’t – though on average folks who are encouraged are no different
from folks who are not (at least not in ways that relate to the outcome.)
2. There is a confounder \(U\), possibly unmeasured, that is related both to potential outcomes
and the exposure, but not to the encouragement (the instrument)! In the example below, we
conceive of \(U\) as an underlying health status.
3. Exposure variable \(T\) (that, in this case, is binary, just to keep things simpler) indicates
whether a person gets the treatment or not.
4. Each individual will have two potential treatments \(T^0\) and \(T^1\), where \(T^0\) is the
treatment when there is no encouragement (i.e. A = 0), and \(T^1\) is the treatment when \(A =
1\). For any individual, we actually only observe one of these treatments (depending on the
actual value of \(A\). The population of interest consists of always-takers, compliers, and
never-takers. Never-takers always reject the treatment regardless of whether or not they get
encouragement – that is, \(T^0 = T^1 = 0\). Compliers only seek out the treatment when they
are encouraged, otherwise they don’t: \(T^0 = 0\) and \(T^1 = 1\). And always-takers always
(of course) seek out the treatment: \(T^0 = T^1 = 1\). (In order for the model to be identifiable,
we need to make a not-so-crazy assumption that there are no so-called deniers, where \(T^0 =
1\) and \(T^1 = 0\).) An individual may have a different complier status depending on the
instrument and exposure (i.e., one person might be a never-taker in one scenario but a
complier in another). In this simulation, larger values of the confounder \(U\) will increase \
(P(T^a = 1)\) for both \(a \in (0,1)\).
5. Each individual will have two potential outcomes, only one of which is observed. \(Y_i^0\) is
the outcome for person \(i\) when they are unexposed or do not receive the treatment. \
(Y_i^1\) is the outcome for that same person when they are exposed or do receive the
treatment. In this case, the confounder \(U\) can affect the potential outcomes. (This diagram
is technically a SWIT, which is template, since I have generically referred to the potential
treatment \(T^a\) and potential outcome \(Y^t\).)
6. Not shown in this diagram are the observed \(T_i\) and \(Y_i\); we assume that \(T_i = (T_i^a |
A = a)\) and \(Y_i = (Y_i^t | T = t)\)
7. Also not shown on the graph is the causal estimand of an exposure for individual \(i\), which
can be defined as \(CE_i \equiv Y^1_i – Y^0_i\). We can calculate the average causal effect, \
(E[CE]\), for the sample as a whole as well as for subgroups.
The workhorse of this data generating process is a logistic sigmoid function that represents the mean
potential outcome \(Y^t\) at each value of \(u\). This allows us to easily generate homogeneous or
heterogeneous causal effects. The function has four parameters, \(M\), \(\gamma\), \(\delta\), and \
(\alpha\):
\[
Y^t = f(u) = M/[1 + exp(-\gamma(u – \delta))] + \alpha,
\]
where \(M\) is the maximum of the function (assuming the minimum is \(0\)), \(\gamma\) is the
steepness of the curve, \(\delta\) is the inflection point of the curve, and \(\alpha\) is a vertical shift of
the entire curve. This function is easily implemented in R:
fYt
The figures below show the mean of the potential outcomes \(Y^0\) and \(Y^1\) under
two different scenarios. On the left, the causal effect at each level of \(u\) is
constant, and on the right, the causal effect changes over the different values of
\(u\), increasing rapidly when \(0 < u < 0.2\).
Here’s a closer look at the the different causal effects under the first scenario
of homogeneous causal effects across values of \(u\) by generating some data. The
data definitions are provided in three steps. In the first step, the confounder \
(U\) is generated. Think of this as health status, which can take on values ranging
from \(-0.5\) to \(0.5\), where lower scores indicate worse health.
Next up are the definitions of the potential outcomes of treatment and outcome,
both of which are dependent on the unmeasured confounder.
library(simstudy)
def
Once all the definitions are set, it is quite simple to generate the data:
set.seed(383726)
dx
dx
## id S A U T0 T1x T1 Y0 Y1 T Y Y.r fS
## 1: 1 1 1 0.282 0 0 0 5.5636 6.01 0 5.5636 6.01 Never
## 2: 2 3 0 0.405 1 1 1 4.5534 5.83 1 5.8301 4.55 Always
## 3: 3 3 1 0.487 1 1 1 6.0098 5.82 1 5.8196 5.82 Always
## 4: 4 2 1 0.498 0 1 1 5.2695 6.43 1 6.4276 6.43 Complier
## 5: 5 1 1 -0.486 0 0 0 0.0088 1.02 0 0.0088 1.02 Never
## ---
## 496: 496 3 0 -0.180 1 0 1 0.8384 1.10 1 1.0966 0.84 Always
## 497: 497 3 1 0.154 1 1 1 4.9118 5.46 1 5.4585 5.46 Always
## 498: 498 1 1 0.333 0 0 0 5.4800 5.46 0 5.4800 5.46 Never
## 499: 499 1 1 0.049 0 0 0 3.4075 5.15 0 3.4075 5.15 Never
## 500: 500 1 1 -0.159 0 0 0 0.4278 0.96 0 0.4278 0.96 Never
The various average causal effects, starting with the (marginal) average causal
effect and ending with the average causal effect for those treated are all close to
\(1\):
ACE
## ceType ce
## 1: ACE 0.97
## 2: AACE 0.96
## 3: CACE 1.00
## 4: NACE 0.96
## 5: ACT 1.05
Here is a visual summary of the generated data. The upper left shows the underlying
data generating functions for the potential outcomes and the upper right plot shows
the various average causal effects: average causal effect for the population (ACE),
average causal effect for always-takers (AACE), complier average causal effect
(CACE), average causal effect for never-takers (NACE), and the average causal
effect for the treated (ACT).
The true individual-specific causal effects color-coded based on complier status
(that we could never observe in the real world, but we can here in simulation
world) are on the bottom left, and the true individual causal effects for those who
received treatment are on the bottom right. These figures are only remarkable in
that all average causal effects and individual causal effects are close to \(1\),
reflecting the homogeneous causal effect data generating process.
Here is a set of figures for a heterogeneous data generating process (which can be
seen on the upper left). Now, the average causal effects are quite different from
each other. In particular \(ACE < CACE < ACT\). Obviously, none of these quantities
is wrong, they are just estimating the average effect for different groups of
people that are characterized by different levels of health status \(U\):
Finally, here is one more scenario, also with heterogeneous causal effects. In this
case \(ACE \approx CACE\), but the other effects are quite different, actually with
different signs.
In the second part of this post, I will use this DGP and estimate these effects
using various modeling techniques. It will hopefully become apparent that different
modeling approaches provide estimates of different causal estimands.
To leave a comment for the author, please follow the link and comment on their blog: ouR
data generation.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as:
Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation),
programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics
(regression, PCA, time series, trading) and more...
(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)
Imagine you are a fish ecologist who compiled a list of fish species for your country. 🐟
Your list could be useful to others, so you publish it as a supplementary file to an article or in a
research repository. That is fantastic, but it might be difficult for others to discover your list or combine it
with other lists of species. Luckily there’s a better way to publish species lists: as a standardized
checklist that can be harvested and processed by the Global Biodiversity Information Facility (GBIF).
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools
for open science.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science,
Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave,
LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and
more...
(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)
Here’s a feature of dplyr that occasionally bites me (most recently while making these graphs). It’s
about to change mostly for the better, but is also likely to bite me again in the future. If you want to
follow along there’s a GitHub repo with the necessary code and data.
Say we have a data frame or tibble and we want to get a frequency table or set of counts out of it. In
this case, each row of our data is a person serving a congressional term for the very first time, for the
years 2013 to 2019. We have information on the term year, the party of the representative, and
whether they are a man or a woman.
library(tidyverse)
## Group labels
mf_labs tibble(M = "Men", F = "Women")
theme_set(theme_minimal())
df
#> > df
#> # A tibble: 280 x 4
#> pid start_year party sex
#>
#> 1 3160 2013-01-03 Republican M
#> 2 3161 2013-01-03 Democrat F
#> 3 3162 2013-01-03 Democrat M
#> 4 3163 2013-01-03 Republican M
#> 5 3164 2013-01-03 Democrat M
#> 6 3165 2013-01-03 Republican M
#> 7 3166 2013-01-03 Republican M
#> 8 3167 2013-01-03 Democrat F
#> 9 3168 2013-01-03 Republican M
#> 10 3169 2013-01-03 Democrat M
#> # ... with 270 more rows
When we load our data into R with read_csv, the columns for party and sex are parsed as character
vectors. If you’ve been around R for any length of time, and especially if you’ve worked in the tidyverse
framework, you’ll be familiar with the drumbeat of “stringsAsFactors=FALSE”, by which we avoid
classing character variables as factors unless we have a good reason to do so (there are several good
reasons), and we don’t do so by default. Thus our df tibble shows us instead of for party and sex.
Now, let’s say we want a count of the number of men and women elected by party in each year.
(Congressional elections happen every two years.) We write a little pipeline to group the data by year,
party, and sex, count up the numbers, and calculate a frequency that’s the proportion of men and
women elected that year within each party. That is, the frequencies of M and F will sum to 1 for each
party in each year.
df %>%
group_by(start_year, party, sex) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N))
#> # A tibble: 14 x 5
#> # Groups: start_year, party [8]
#> start_year party sex N freq
#>
#> 1 2013-01-03 Democrat F 21 0.362
#> 2 2013-01-03 Democrat M 37 0.638
#> 3 2013-01-03 Republican F 8 0.101
#> 4 2013-01-03 Republican M 71 0.899
#> 5 2015-01-03 Democrat M 1 1
#> 6 2015-01-03 Republican M 5 1
#> 7 2017-01-03 Democrat F 6 0.24
#> 8 2017-01-03 Democrat M 19 0.76
#> 9 2017-01-03 Republican F 2 0.0667
#> 10 2017-01-03 Republican M 28 0.933
#> 11 2019-01-03 Democrat F 33 0.647
#> 12 2019-01-03 Democrat M 18 0.353
#> 13 2019-01-03 Republican F 1 0.0323
#> 14 2019-01-03 Republican M 30 0.968
You can see that, in 2015, neither party had a woman elected to Congress for the first time. Thus, the
freq is 1 in row 5 and row 6. But you can also see that, because there are no observed Fs in 2015,
they don’t show up in the table at all. The zero values are dropped. These rows, call them 5' and 6'
don’t appear:
df %>%
group_by(start_year, party, sex) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ggplot(aes(x = start_year,
y = freq,
fill = sex)) +
geom_col() +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = sex_colors, labels = c("Women", "Men")) +
labs(x = "Year", y = "Percent", fill = "Group") +
facet_wrap(~ party)
ggsave("figures/df_chr_col.png")
That looks fine. You can see in each panel the 2015 column is 100% Men. If we were working on this a
bit longer we’d polish up the x-axis so that the dates were centered under the columns. But as an
exploratory plot it’s fine.
But let’s say that, instead of a column plot, you looked at a line plot instead. This would be a natural
thing to do given that time is on the x-axis and so you’re looking at a trend, albeit one over a small
number of years.
df %>%
group_by(start_year, party, sex) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ggplot(aes(x = start_year,
y = freq,
color = sex)) +
geom_line(size = 1.1) +
scale_y_continuous(labels = scales::percent) +
scale_color_manual(values = sex_colors, labels = c("Women", "Men")) +
guides(color = guide_legend(reverse = TRUE)) +
labs(x = "Year", y = "Percent", color = "Group") +
facet_wrap(~ party)
ggsave("figures/df_chr_line.png")
A line graph based on character-encoded variables for party and sex. The trend line for
Women joins up the observed (or rather, the included) values, which don’t include the zero
values for 2015.
A line graph based on character-encoded variables for party and sex. The trend line for
Women joins up the observed (or rather, the included) values, which don’t include the zero
values for 2015.
That’s not right. The line segments join up the data points in the summary tibble, but because those
don’t include the zero-count rows in the case of women, the lines join the 2013 and 2017 values
directly. So we miss that the count (and thus the frequency) went to zero in that year.
This issue has been recognized in dplyr for some time. It happened whether your data was encoded as
character or as a factor. There’s a huge thread about it in the development version on GitHub, going
back to 2014. In the upcoming version 0.8 release of dplyr, the behavior for zero-count rows will
change, but as far as I can make out it will change for factors only. Let’s see what happens when we
change the encoding of our data frame. We’ll make a new one, called df_f.
df_f %>%
group_by(start_year, party, sex) %>%
tally()
#> # A tibble: 16 x 4
#> # Groups: start_year, party [8]
#> start_year party sex n
#>
#> 1 2013-01-03 Democrat F 21
#> 2 2013-01-03 Democrat M 37
#> 3 2013-01-03 Republican F 8
#> 4 2013-01-03 Republican M 71
#> 5 2015-01-03 Democrat F 0
#> 6 2015-01-03 Democrat M 1
#> 7 2015-01-03 Republican F 0
#> 8 2015-01-03 Republican M 5
#> 9 2017-01-03 Democrat F 6
#> 10 2017-01-03 Democrat M 19
#> 11 2017-01-03 Republican F 2
#> 12 2017-01-03 Republican M 28
#> 13 2019-01-03 Democrat F 33
#> 14 2019-01-03 Democrat M 18
#> 15 2019-01-03 Republican F 1
#> 16 2019-01-03 Republican M 30
Now we have party and sex encoded as unordered factors. This time, our zero rows are present (here
as rows 5 and 7). The grouping and summarizing operation has preserved all the factor values by
default, instead of dropping the ones with no observed values in any particular year. Let’s run our line
graph code again:
df_f %>%
group_by(start_year, party, sex) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ggplot(aes(x = start_year,
y = freq,
color = sex)) +
geom_line(size = 1.1) +
scale_y_continuous(labels = scales::percent) +
scale_color_manual(values = sex_colors, labels = c("Women", "Men")) +
guides(color = guide_legend(reverse = TRUE)) +
labs(x = "Year", y = "Percent", color = "Group") +
facet_wrap(~ party)
ggsave("figures/df_fac_line.png")
A line graph based on factor-encoded variables for party and sex. Now the trend line for
Women does include the zero values, as they are preserved in the summary.
A line graph based on factor-encoded variables for party and sex. Now the trend line for
Women does include the zero values, as they are preserved in the summary.
Now the trend line goes to zero, as it should. (And by the same token the trend line for Men goes to
100%.)
What if we want to keep working with our variables encoded as characters rather than factors? There is
a workaround, using the complete() function. You will need to ungroup() the data after summarizing
it, and then use complete() to fill in the implicit missing values. You have to re-specify the grouping
structure for complete, and then tell it what you want the fill-in value to be for your summary variables.
In this case it’s zero.
df %>%
group_by(start_year, party, sex) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ungroup() %>%
complete(start_year, party, sex,
fill = list(N = 0, freq = 0))
#> # A tibble: 16 x 5
#> start_year party sex N freq
#>
#> 1 2013-01-03 Democrat F 21 0.362
#> 2 2013-01-03 Democrat M 37 0.638
#> 3 2013-01-03 Republican F 8 0.101
#> 4 2013-01-03 Republican M 71 0.899
#> 5 2015-01-03 Democrat F 0 0
#> 6 2015-01-03 Democrat M 1 1
#> 7 2015-01-03 Republican F 0 0
#> 8 2015-01-03 Republican M 5 1
#> 9 2017-01-03 Democrat F 6 0.24
#> 10 2017-01-03 Democrat M 19 0.76
#> 11 2017-01-03 Republican F 2 0.0667
#> 12 2017-01-03 Republican M 28 0.933
#> 13 2019-01-03 Democrat F 33 0.647
#> 14 2019-01-03 Democrat M 18 0.353
#> 15 2019-01-03 Republican F 1 0.0323
#> 16 2019-01-03 Republican M 30 0.968
If we re-draw the line plot with the ungroup() ... complete() step included, we’ll get the correct
output in our line plot, just as in the factor case.
df %>%
group_by(start_year, party, sex) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ungroup() %>%
complete(start_year, party, sex,
fill = list(N = 0, freq = 0)) %>%
ggplot(aes(x = start_year,
y = freq,
color = sex)) +
geom_line(size = 1.1) +
scale_y_continuous(labels = scales::percent) +
scale_color_manual(values = sex_colors, labels = c("Women", "Men")) +
guides(color = guide_legend(reverse = TRUE)) +
labs(x = "Year", y = "Percent", color = "Group") +
facet_wrap(~ party)
ggsave("figures/df_chr_line_2.png")
The new zero-preserving behavior of group_by() for factors will show up in the upcoming version 0.8
of dplyr. It’s already there in the development version if you like to live dangerously. In the meantime, if
you want your frequency tables to include zero counts, then make sure you ungroup() and then
complete() the summary tables.
To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science,
Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave,
LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and
more...
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
I've posted several examples here of using Azure Cognitive Services for data science applications.
You can upload an an image or video to the service and extract information about faces and
emotions, generate a caption describing a scene from a provided photo, or speak written text in a
natural voice. (If you haven't tried the Cognitive Services tools yet, you can try them out using the
instructions in this notebook using only a browser.)
But what if you can't upload an image or text to the cloud? Sending data outside your network might be
subject to regulatory or privacy policies. And if you could analyze the images or text locally, your
application could benefit from reduced latency and bandwidth.
Now, several of the Azure Cognitive Services APIs are available as Docker containers: you can
download a container that provides the exact same APIs as the cloud-based services, and run it on a
local Linux-based server or edge device. Images and text are processed directly in the container and
never sent to the cloud. A connection to Azure is required only for billing, which is at the same rate as
the cloud-based services (including a free tier).
Face: face detection, identity verification, and emotion detection. In private preview: sign up here.
Free and paid tier details here.
Recognize text: detect and extract printed text from images. In private preview: sign up here.
Free and paid tier details here.
Text Analytics: Key phrase extraction, language detection, and sentiment analysis. In public
preview, available to all Azure subscribers. Free and paid tier details here.
You can learn more about Cognitive Services in the video below. (The information about the new
container support starts at 11:20.)
You can also find detailed information about the Cognitive Services in containers in the blog post linked
below.
Microsoft Azure blog: Getting started with Azure Cognitive Services in containers
To leave a comment for the author, please follow the link and comment on their blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science,
Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave,
LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and
more...
This posting includes an audio/video/photo media file: Download Now
You are subscribed to email updates from R-bloggers. Email delivery powered by Google
To stop receiving these emails, you may unsubscribe now.