Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary
Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary
Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary
Kaggle’s third annual survey of its community Data scientists have adopted cloud computing in
shows the worldwide reach of data science. Based their work, though not as a replacement for local
on responses from 19,717 Kaggle members, this developer environments. Nevertheless, many have
report focuses on the 21% that are currently significant budgets for cloud tools, with the United
employed as data scientists. Overall, we see a States spending beyond others. Google Cloud
relatively young, highly educated community working Platform usage grew compared to the 2018 survey,
at companies of all sizes that are still figuring out the with overall usage second to AWS. Among cloud
best way to adopt machine learning technologies. machine learning tools, use of Google Cloud AutoML
nearly doubled since last year.
The Kaggle community includes learners and
practitioners of all levels. This analysis focuses on the While many data scientists have advanced degrees,
professional data scientists within the community— most continue to learn new data science skills. Blogs,
their education, employment, and the tools used to Kaggle forums, Coursera, and YouTube are among
perform their work. You’ll see certain regions—most the common methods of ongoing education. With
notably the United States and India—represented at many companies still new to machine learning, it’s
the extremes of the data. clear there is still a need for both instruction and
practical application of techniques.
Report Methodology
The content of this report focuses on respondents Many survey questions were multiple choice with
who are currently employed and chose their current the ability for respondents to select all options
job title as “data scientist”. There are many other job that applied to them. For that reason, you may see
titles that support data science and machine learning visualizations where the total percentage is more
workflows and you can find their responses in the than 100%. All monetary amounts captured in the
complete 2019 survey dataset on Kaggle. report are in USD.
Key Results
Over half of data scientists are less than thirty years old.
Unsurprisingly, data scientists are highly educated, with well over half
obtaining advanced degrees.
More than half of respondents have fewer than five years of coding
experience and even less experience with machine learning.
Salaries for data scientists in the United States far exceed other
countries.
Nearly one in four professional data scientists have still not adopted
cloud computing.
The United States spends far more on machine learning and cloud
computing products than the rest of the world.
Gender GENDER
100%
There’s still a significant gender gap for data
scientists, with 84% of users identifying as males. 80%
Kaggle surveys.
20%
0%
Male Female Prefer not to say Prefer to self
describe
Age
Millennials dominate data science, with 25-29 year
olds being the most common age bracket. In India,
the numbers skew even younger, where 41% are
19-24. However, adults of all ages are exploring data
science, with 18% of all respondents 40 or older.
AGE
30%
20%
15%
10%
5%
0%
18-21 22-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69 70+
Country
The largest number of responses to the survey were
from the United States and India. Brazil and Russia
were the next-most common locations. Countries
not shown (such as many in Central Africa) had no
responses.
COUNTRY RESPONSES
600
400
200
Education
Higher Education
The data scientist community is highly educated.
Looking at only employed data scientists, over 70%
of respondents have a degree above a bachelor’s
degree, with a majority (~52%) having a master’s
degree. While 19% of the total respondents have
PhDs, it varied greatly by country. Germany had the
highest percentage of respondents who held doctoral
degrees with 38%, while India had the lowest
percentage with under 5%.
EDUCATION
Master’s degree
Bachelor’s
degree
Doctoral degree
Professional
degree
Some college/
university study
without earning a
bachelor’s degree
No formal
education past high
school
I prefer not to
answer
MEDIA CONSUMPTION
80%
70%
60%
50%
40%
30%
20%
10%
0%
Blogs Kaggle Youtube Journal Twitter Course Reddit Slack Podcasts Hacker Other None
Forums News
Data Science &
Machine Learning Experience
The global data scientist community consists of an
equal amount of new learners and seasoned veterans.
The most common (33%) range is 3-5 years
experience. Roughly one-third have less than three
years experience and another third have more than
five years experience.
Global
<1 years
1-2 years
3-5 years
5-10 years
10-20 years
20+ years
Global
USA
<1 years
1-2 years
2-3 years
3-4 years
4-5 years
5-10 years
10-15 years
20+ years
0% 10% 20% 30%
Employment
Pay
We asked data scientists about their salary, United States data scientists average higher
employer type, and how they spend their time. wages than others surveyed, followed by Germany
Results varied considerably by country, especially and Japan. India, on the other hand, sees lower
when it came to pay. salaries, with nearly 20% of Indian respondents
earning less than $1,000 annually.
100k-125k
95k-100k
85k-90k
75k-80k
65k-70k
55k-60k
45k-50k
35k-40k
25k-30k
15k-20k
10k-15k
SALARY
20%
15%
10%
5%
0%
0-1k
1k-2k
2k-3k
3k-4k
4k-5k
5k-7.4k
7.5k-10k
10k-15k
15k-20k
20k-25k
25k-30k
30k-40k
40k-50k
50k-60k
60k-70k
70k-80k
80k-90k
90k-100k
100k-125k
125k-150k
150k-200k
200k-250k
250k-300k
300k-500k
>500k
Time Spent Machine learning does play a healthy role in the work
of a data scientist. Prototyping and experimenting
What do users say is the most common duty of
with machine learning were mentioned by more
being a data scientist? More than complex machine
than half of respondents.
learning, over 75% suggested understanding and
analyzing the data is a common activity. Perhaps
this explains how Kagglers are able to create so
many great EDA kernels in the first hour of every new
competition!
Analyze and
underst and dat a
Build prototypes
to explore
applying ML
Experiment and
iterate to
improve existing
ML models
Build and/
or run a ML
service
operationally
Do research that
advances the
st ate of the art
of ML
Other
None of these
activities are
important to my
role
0-49
50-249
250-999
1,000-9,999
>10,000
1-2
3-4
5-9
10-14
15-19
20+
40% 2019
2018
30%
20%
10%
0%
We are explor- We use ML We recently We have well No (we do not I do not know
ing ML methods methods for st arted using ML established ML use ML methods)
(and may one generating methods (i.e., methods (i.e.,
day put a model insights (but do models in models in pro-
into production) not put working production for duction for more
models into less than 2 years) than 2 years)
production)
Spending The story is different in the United States, where
a plurality (24%) have spent over $100,000 on
Internationally, the plurality of respondents (23%)
products in the past five years. Only 34% report
didn’t spend money on machine learning and
having spent less than $1000, compared to nearly
cloud computing products at all. That being said,
43% globally.
11% of respondents have spent more than $100,000.
Global
30%
USA
20%
10%
0%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Jupyter RStudio PyCharm Visual Notepad Spyder Sublime Vim/ Atom MAT- Other None
(Jupyter- Studio/ ++ Text Emacs LAB
Lab, Visual
Jupy- Studio
terNot- Code
ebooks,
etc.)
Methods and Algorithms
Respondents are big fans of keeping it simple.
The most common methods are linear or logistic
regression, followed by decision trees. While not as
powerful as more complex techniques, they can still
be quite effective and are easier to interpret.
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Linear or Decision Gradient Convo- Bayeian Dense Recurrent Transformer Genera- Evolution- Other None
Logistic Trees or Boosting lutional Ap- Neural Neural Networks tive Ad- ary Ap-
Regression Random Ma- Neural proaches Networks Network versarial proaches
Forests chines Networks Networks
As for the machine learning frameworks used
to employ their techniques, data scientists use
multiple tools. Over 80% use Scikit-learn, a Python
package containing popular data science algorithms.
TensorFlow and Keras, often used in combination,
continue to be the dominant deep learning
framework.
FRAMEWORKS USAGE
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Scikit- Keras Xgboost Tensor- Random- LightGBM PyTorch Caret Spark Fast .ai Other None
learn Flow Forest MLib
Enterprise Tools
Most professional data scientists are making use of
cloud computing, though over 24% still are not. AWS,
Google Cloud Platform, and Microsoft Azure are by
far the top three choices among data scientists using
cloud tools.
Amazon Web
Services (AWS
Google Cloud
Platform(GCP)
None
Microsoft Azure
IBM Cloud
Other
Oracle Cloud
VMwareCloud
Salesforce
Cloud
SAP Cloud
Alibaba Cloud
8% Global
USA
6%
4%
2%
0%
2018 2019
Conclusion