Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary

Kaggle’s State of Data
Science and Machine

Learning 2019
Enterprise Executive Summary
Kaggle’s State of Machine Learning and Data
Science 2019 survey is the most comprehensive
dataset available on the state of machine learning
and data science today. This executive summary is
limited to responses from individuals who selected
their current job title as “data scientist.”
Overview
Kaggle’s third annual survey of its community Data scientists have adopted cloud computing in
shows the worldwide reach of data science. Based their work, though not as a replacement for local
on responses from 19,717 Kaggle members, this developer environments. Nevertheless, many have
report focuses on the 21% that are currently significant budgets for cloud tools, with the United
employed as data scientists. Overall, we see a States spending beyond others. Google Cloud
relatively young, highly educated community working Platform usage grew compared to the 2018 survey,
at companies of all sizes that are still figuring out the with overall usage second to AWS. Among cloud
best way to adopt machine learning technologies. machine learning tools, use of Google Cloud AutoML
nearly doubled since last year.
The Kaggle community includes learners and
practitioners of all levels. This analysis focuses on the While many data scientists have advanced degrees,
professional data scientists within the community— most continue to learn new data science skills. Blogs,
their education, employment, and the tools used to Kaggle forums, Coursera, and YouTube are among
perform their work. You’ll see certain regions—most the common methods of ongoing education. With
notably the United States and India—represented at many companies still new to machine learning, it’s
the extremes of the data. clear there is still a need for both instruction and
practical application of techniques.
Report Methodology
The content of this report focuses on respondents Many survey questions were multiple choice with
who are currently employed and chose their current the ability for respondents to select all options
job title as “data scientist”. There are many other job that applied to them. For that reason, you may see
titles that support data science and machine learning visualizations where the total percentage is more
workflows and you can find their responses in the than 100%. All monetary amounts captured in the
complete 2019 survey dataset on Kaggle. report are in USD.
Key Results
Data science is mostly male, an imbalance that has remained unchanged

from previous years.
Over half of data scientists are less than thirty years old.
Unsurprisingly, data scientists are highly educated, with well over half
obtaining advanced degrees.
More than half of respondents have fewer than five years of coding
experience and even less experience with machine learning.
Salaries for data scientists in the United States far exceed other
countries.
Most data scientists work at either small or very large organizations.
More than half of companies are new to machine learning.
Local development environments are the most common way data

scientists perform their work.
Nearly one in four professional data scientists have still not adopted
cloud computing.
TensorFlow and Keras continue to be the dominant deep learning

framework.
The United States spends far more on machine learning and cloud
computing products than the rest of the world.
Simple methods, such as linear regressions and decision trees, dominate

despite the power of more complex techniques.
Usage of Google Cloud AutoML nearly doubled compared to last year.

Data Scientist Profile
Gender GENDER
100%
There’s still a significant gender gap for data
scientists, with 84% of users identifying as males. 80%
The United States has a slightly smaller gender gap

at 79%, while Japan has a slightly higher one at 90%. 60%
The results are relatively consistent regardless of

region and has not changed since results of earlier 40%
Kaggle surveys.
20%
0%
Male Female Prefer not to say Prefer to self
describe
Age
Millennials dominate data science, with 25-29 year
olds being the most common age bracket. In India,
the numbers skew even younger, where 41% are
19-24. However, adults of all ages are exploring data
science, with 18% of all respondents 40 or older.
AGE
30%
20%
15%
10%
5%
0%
18-21 22-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69 70+
Country
The largest number of responses to the survey were
from the United States and India. Brazil and Russia
were the next-most common locations. Countries
not shown (such as many in Central Africa) had no
responses.
COUNTRY RESPONSES
600
400
200
Education
Higher Education
The data scientist community is highly educated.
Looking at only employed data scientists, over 70%
of respondents have a degree above a bachelor’s
degree, with a majority (~52%) having a master’s
degree. While 19% of the total respondents have
PhDs, it varied greatly by country. Germany had the
highest percentage of respondents who held doctoral
degrees with 38%, while India had the lowest
percentage with under 5%.
More than 99.5% of data scientists pursued some

education after high school.
EDUCATION
Master’s degree
Bachelor’s
degree
Doctoral degree
Professional
degree
Some college/
university study
without earning a
bachelor’s degree
No formal
education past high
school
I prefer not to
answer
0% 10% 20% 30% 40% 50% 60%

Ongoing Learning
Over 70% of data scientists said they learned
through reading blogs. Using the Kaggle forums is
also popular among Kaggle users, with over 65%
using them. There were many other responses,
but one thing is certain: the vast majority of
data scientists are still learning; only ~2% of
respondents said they don’t use any media to
improve their data science skills.
MEDIA CONSUMPTION
80%
70%
60%
50%
40%
30%
20%
10%
0%
Blogs Kaggle Youtube Journal Twitter Course Reddit Slack Podcasts Hacker Other None
Forums News
Data Science &
Machine Learning Experience
The global data scientist community consists of an
equal amount of new learners and seasoned veterans.
The most common (33%) range is 3-5 years
experience. Roughly one-third have less than three
years experience and another third have more than
five years experience.
TIME SPENT LEARNING CODE
Global
I have never USA

written code
<1 years
1-2 years
3-5 years
5-10 years
10-20 years
20+ years
0% 10% 20% 30% 40%

Machine learning is less normally distributed. While Compared to the international figures, the US has
most have more than a year of experience, 35% more data scientists in their first two years (17%) and
are still in their first two years of using machine more with 10+ years of experience (14%).
learning. About 6% have more than 10 years of
machine learning experience.
TIME SPENT LEARNING MACHINE LEARNING
Global
USA
<1 years
1-2 years
2-3 years
3-4 years
4-5 years
5-10 years
10-15 years
20+ years
0% 10% 20% 30%
Employment
Pay
We asked data scientists about their salary, United States data scientists average higher
employer type, and how they spend their time. wages than others surveyed, followed by Germany
Results varied considerably by country, especially and Japan. India, on the other hand, sees lower
when it came to pay. salaries, with nearly 20% of Indian respondents
earning less than $1,000 annually.
GLOBAL SALARIES ($USD)
100k-125k
95k-100k
85k-90k
75k-80k
65k-70k
55k-60k
45k-50k
35k-40k
25k-30k
15k-20k
10k-15k
India Brazil Russia Japan Germany USA

Those employed as data scientists in the United
States fall within ranges near the top of the scale
used in our survey. The majority make between
$100,000 to $200,000.
SALARY
20%
15%
10%
5%
0%
0-1k
1k-2k
2k-3k
3k-4k
4k-5k
5k-7.4k
7.5k-10k
10k-15k
15k-20k
20k-25k
25k-30k
30k-40k
40k-50k
50k-60k
60k-70k
70k-80k
80k-90k
90k-100k
100k-125k
125k-150k
150k-200k
200k-250k
250k-300k
300k-500k
>500k
Time Spent Machine learning does play a healthy role in the work
of a data scientist. Prototyping and experimenting
What do users say is the most common duty of
with machine learning were mentioned by more
being a data scientist? More than complex machine
than half of respondents.
learning, over 75% suggested understanding and
analyzing the data is a common activity. Perhaps
this explains how Kagglers are able to create so
many great EDA kernels in the first hour of every new
competition!
HOW DATA SCIENTISTS SPEND THEIR TIME
Analyze and
underst and dat a
Build prototypes
to explore
applying ML
Experiment and
iterate to
improve existing
ML models
Build and/
or run a ML
service
operationally
Build and/or run

the dat a
infrastructure
Do research that
advances the
st ate of the art
of ML
Other
None of these
activities are
important to my
role
0% 10% 20% 30% 40% 50% 60% 70% 80%

Companies Employing Data Science Data scientists tend to congregate at
both ends of the company size spectrum.
We asked data scientists more about the
The most common responses were from
organizations where they worked: number of
representatives of companies with less than
employees, team sizes, and how the companies have
50 employees. Next came the much larger
adopted machine learning practices.
companies with more than 10,000 employees.
COMPANY SIZE (# OF EMPLOYEES)
0-49
50-249
250-999
1,000-9,999
>10,000
0% 10% 20% 30%
Data Science Teams Of users that are currently employed as a data

scientist, 4% reported that they had a team size of
The size of the data science team varies, though
zero. Either these respondents weren’t counting
25% work on teams with 20 or more members.
themselves, or perhaps data science is only a
Combining the lower ranges, we see over 40% work
portion of their responsibilities.
on teams of fewer than five people.
DATA SCIENCE TEAMS (# OF EMPLOYEES)
1-2
3-4
5-9
10-14
15-19
20+
0% 10% 20% 30%

Enterprise Machine Learning Adoption
Matching other responses, machine learning is
becoming more popular. Over 30% of users say
their company has recently started using machine
learning methods and 17% say they’re exploring
machine learning methods. The percentage of data
scientists that work for companies with well estab-
lished machine learning methods increased by 11%
from 2018.
MACHINE LEARNING ADOPTION
40% 2019
2018
30%
20%
10%
0%
We are explor- We use ML We recently We have well No (we do not I do not know
ing ML methods methods for st arted using ML established ML use ML methods)
(and may one generating methods (i.e., methods (i.e.,
day put a model insights (but do models in models in pro-
into production) not put working production for duction for more
models into less than 2 years) than 2 years)
production)
Spending The story is different in the United States, where
a plurality (24%) have spent over $100,000 on
Internationally, the plurality of respondents (23%)
products in the past five years. Only 34% report
didn’t spend money on machine learning and
having spent less than $1000, compared to nearly
cloud computing products at all. That being said,
43% globally.
11% of respondents have spent more than $100,000.
ENTERPRISE SPENDING IN PAST 5 YEARS ($USD)
Global
30%
USA
20%
10%
0%
0 1-100 100-1k 1k-10k 10k-100k >100k

Technology
Interactive Development Environments

The most common analytics tools are by far local
development environments. Out of those, Jupyter-
Lab and its offshoots are the most common, with
83% of data scientists using it on a regular basis.
POPULAR IDE USAGE
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Jupyter RStudio PyCharm Visual Notepad Spyder Sublime Vim/ Atom MAT- Other None
(Jupyter- Studio/ ++ Text Emacs LAB
Lab, Visual
Jupy- Studio
terNot- Code
ebooks,
etc.)
Methods and Algorithms
Respondents are big fans of keeping it simple.
The most common methods are linear or logistic
regression, followed by decision trees. While not as
powerful as more complex techniques, they can still
be quite effective and are easier to interpret.
METHODS AND ALGORITHMS USAGE
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Linear or Decision Gradient Convo- Bayeian Dense Recurrent Transformer Genera- Evolution- Other None
Logistic Trees or Boosting lutional Ap- Neural Neural Networks tive Ad- ary Ap-
Regression Random Ma- Neural proaches Networks Network versarial proaches
Forests chines Networks Networks
As for the machine learning frameworks used
to employ their techniques, data scientists use
multiple tools. Over 80% use Scikit-learn, a Python
package containing popular data science algorithms.
TensorFlow and Keras, often used in combination,
continue to be the dominant deep learning
framework.
FRAMEWORKS USAGE
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Scikit- Keras Xgboost Tensor- Random- LightGBM PyTorch Caret Spark Fast .ai Other None
learn Flow Forest MLib
Enterprise Tools
Most professional data scientists are making use of
cloud computing, though over 24% still are not. AWS,
Google Cloud Platform, and Microsoft Azure are by
far the top three choices among data scientists using
cloud tools.
ENTERPRISE TOOLS USAGE
Amazon Web
Services (AWS
Google Cloud
Platform(GCP)
None
Microsoft Azure
IBM Cloud
Other
Oracle Cloud
VMwareCloud
Red Hat Cloud
Salesforce
Cloud
SAP Cloud
Alibaba Cloud
0% 10% 20% 30% 40% 50%

Automated Machine Learning
Particularly notable is the growth of Google Cloud
AutoML since last year’s survey. The number of
respondents using this machine learning platform
nearly doubled overall, with a similar rate of growth
among US-based data scientists.
GOOGLE CLOUD AUTOML
8% Global
USA
6%
4%
2%
0%
2018 2019
Conclusion
This 2019 edition of the State of Machine Learning

and Data Science includes insights gathered from
a survey of 19,717 Kaggle members. Their answers
covered demographic, education, employment, and
technology usage.
Much of the charts and results are culled from

professional data scientists (covering 21% of
respondents). There’s even more to uncover in the
most comprehensive dataset available on the state of
machine learning and data science today.
Kaggle has published the complete dataset of

responses for the community to review, and we’ll run
a competition from November 11 to December 2nd to
learn even more about data science practitioners in
2019.

Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary

Uploaded by

Copyright:

Available Formats

Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary

Uploaded by

Copyright:

Available Formats

Kaggle’s State of Data

Science and Machine

Data science is mostly male, an imbalance that has remained unchanged

Most data scientists work at either small or very large organizations.

More than half of companies are new to machine learning.

Local development environments are the most common way data

TensorFlow and Keras continue to be the dominant deep learning

Simple methods, such as linear regressions and decision trees, dominate

Usage of Google Cloud AutoML nearly doubled compared to last year.

The United States has a slightly smaller gender gap

The results are relatively consistent regardless of

More than 99.5% of data scientists pursued some

0% 10% 20% 30% 40% 50% 60%

TIME SPENT LEARNING CODE

I have never USA

0% 10% 20% 30% 40%

TIME SPENT LEARNING MACHINE LEARNING

GLOBAL SALARIES ($USD)

India Brazil Russia Japan Germany USA

HOW DATA SCIENTISTS SPEND THEIR TIME

Build and/or run

0% 10% 20% 30% 40% 50% 60% 70% 80%

COMPANY SIZE (# OF EMPLOYEES)

0% 10% 20% 30%

Data Science Teams Of users that are currently employed as a data

DATA SCIENCE TEAMS (# OF EMPLOYEES)

0% 10% 20% 30%

MACHINE LEARNING ADOPTION

ENTERPRISE SPENDING IN PAST 5 YEARS ($USD)

0 1-100 100-1k 1k-10k 10k-100k >100k

Interactive Development Environments

POPULAR IDE USAGE

METHODS AND ALGORITHMS USAGE

ENTERPRISE TOOLS USAGE

Red Hat Cloud

0% 10% 20% 30% 40% 50%

GOOGLE CLOUD AUTOML

This 2019 edition of the State of Machine Learning

Much of the charts and results are culled from

Kaggle has published the complete dataset of

You might also like