Kaggle's State of Data Science and Machine Learning 2019: Enterprise Executive Summary

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Kaggle’s State of Data

Science and Machine


Learning 2019
Enterprise Executive Summary
Kaggle’s State of Machine Learning and Data
Science 2019 survey is the most comprehensive
dataset available on the state of machine learning
and data science today. This executive summary is
limited to responses from individuals who selected
their current job title as “data scientist.”
Overview

Kaggle’s third annual survey of its community Data scientists have adopted cloud computing in
shows the worldwide reach of data science. Based their work, though not as a replacement for local
on responses from 19,717 Kaggle members, this developer environments. Nevertheless, many have
report focuses on the 21% that are currently significant budgets for cloud tools, with the United
employed as data scientists. Overall, we see a States spending beyond others. Google Cloud
relatively young, highly educated community working Platform usage grew compared to the 2018 survey,
at companies of all sizes that are still figuring out the with overall usage second to AWS. Among cloud
best way to adopt machine learning technologies. machine learning tools, use of Google Cloud AutoML
nearly doubled since last year.
The Kaggle community includes learners and
practitioners of all levels. This analysis focuses on the While many data scientists have advanced degrees,
professional data scientists within the community— most continue to learn new data science skills. Blogs,
their education, employment, and the tools used to Kaggle forums, Coursera, and YouTube are among
perform their work. You’ll see certain regions—most the common methods of ongoing education. With
notably the United States and India—represented at many companies still new to machine learning, it’s
the extremes of the data. clear there is still a need for both instruction and
practical application of techniques.

Report Methodology

The content of this report focuses on respondents Many survey questions were multiple choice with
who are currently employed and chose their current the ability for respondents to select all options
job title as “data scientist”. There are many other job that applied to them. For that reason, you may see
titles that support data science and machine learning visualizations where the total percentage is more
workflows and you can find their responses in the than 100%. All monetary amounts captured in the
complete 2019 survey dataset on Kaggle. report are in USD.
Key Results

Data science is mostly male, an imbalance that has remained unchanged


from previous years.

Over half of data scientists are less than thirty years old.

Unsurprisingly, data scientists are highly educated, with well over half
obtaining advanced degrees.

More than half of respondents have fewer than five years of coding
experience and even less experience with machine learning.

Salaries for data scientists in the United States far exceed other
countries.

Most data scientists work at either small or very large organizations.

More than half of companies are new to machine learning.

Local development environments are the most common way data


scientists perform their work.

Nearly one in four professional data scientists have still not adopted
cloud computing.

TensorFlow and Keras continue to be the dominant deep learning


framework.

The United States spends far more on machine learning and cloud
computing products than the rest of the world.

Simple methods, such as linear regressions and decision trees, dominate


despite the power of more complex techniques.

Usage of Google Cloud AutoML nearly doubled compared to last year.


Data Scientist Profile

Gender GENDER
100%
There’s still a significant gender gap for data
scientists, with 84% of users identifying as males. 80%

The United States has a slightly smaller gender gap


at 79%, while Japan has a slightly higher one at 90%. 60%

The results are relatively consistent regardless of


region and has not changed since results of earlier 40%

Kaggle surveys.
20%

0%
Male Female Prefer not to say Prefer to self
describe

Age
Millennials dominate data science, with 25-29 year
olds being the most common age bracket. In India,
the numbers skew even younger, where 41% are
19-24. However, adults of all ages are exploring data
science, with 18% of all respondents 40 or older.

AGE

30%

20%

15%

10%

5%

0%
18-21 22-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69 70+
Country
The largest number of responses to the survey were
from the United States and India. Brazil and Russia
were the next-most common locations. Countries
not shown (such as many in Central Africa) had no
responses.

COUNTRY RESPONSES

600

400

200
Education

Higher Education
The data scientist community is highly educated.
Looking at only employed data scientists, over 70%
of respondents have a degree above a bachelor’s
degree, with a majority (~52%) having a master’s
degree. While 19% of the total respondents have
PhDs, it varied greatly by country. Germany had the
highest percentage of respondents who held doctoral
degrees with 38%, while India had the lowest
percentage with under 5%.

More than 99.5% of data scientists pursued some


education after high school.

EDUCATION

Master’s degree

Bachelor’s
degree

Doctoral degree

Professional
degree

Some college/
university study
without earning a
bachelor’s degree

No formal
education past high
school

I prefer not to
answer

0% 10% 20% 30% 40% 50% 60%


Ongoing Learning
Over 70% of data scientists said they learned
through reading blogs. Using the Kaggle forums is
also popular among Kaggle users, with over 65%
using them. There were many other responses,
but one thing is certain: the vast majority of
data scientists are still learning; only ~2% of
respondents said they don’t use any media to
improve their data science skills.

MEDIA CONSUMPTION

80%

70%

60%

50%

40%

30%

20%

10%

0%

Blogs Kaggle Youtube Journal Twitter Course Reddit Slack Podcasts Hacker Other None
Forums News
Data Science &
Machine Learning Experience
The global data scientist community consists of an
equal amount of new learners and seasoned veterans.
The most common (33%) range is 3-5 years
experience. Roughly one-third have less than three
years experience and another third have more than
five years experience.

TIME SPENT LEARNING CODE

Global

I have never USA


written code

<1 years

1-2 years

3-5 years

5-10 years

10-20 years

20+ years

0% 10% 20% 30% 40%


Machine learning is less normally distributed. While Compared to the international figures, the US has
most have more than a year of experience, 35% more data scientists in their first two years (17%) and
are still in their first two years of using machine more with 10+ years of experience (14%).
learning. About 6% have more than 10 years of
machine learning experience.

TIME SPENT LEARNING MACHINE LEARNING

Global

USA
<1 years

1-2 years

2-3 years

3-4 years

4-5 years

5-10 years

10-15 years

20+ years
0% 10% 20% 30%
Employment

Pay
We asked data scientists about their salary, United States data scientists average higher
employer type, and how they spend their time. wages than others surveyed, followed by Germany
Results varied considerably by country, especially and Japan. India, on the other hand, sees lower
when it came to pay. salaries, with nearly 20% of Indian respondents
earning less than $1,000 annually.

GLOBAL SALARIES ($USD)

100k-125k

95k-100k

85k-90k

75k-80k

65k-70k

55k-60k

45k-50k

35k-40k

25k-30k

15k-20k

10k-15k

India Brazil Russia Japan Germany USA


Those employed as data scientists in the United
States fall within ranges near the top of the scale
used in our survey. The majority make between
$100,000 to $200,000.

SALARY

20%

15%

10%

5%

0%
0-1k

1k-2k

2k-3k

3k-4k

4k-5k

5k-7.4k

7.5k-10k

10k-15k

15k-20k

20k-25k

25k-30k

30k-40k

40k-50k

50k-60k

60k-70k

70k-80k

80k-90k

90k-100k

100k-125k

125k-150k

150k-200k

200k-250k

250k-300k

300k-500k

>500k
Time Spent Machine learning does play a healthy role in the work
of a data scientist. Prototyping and experimenting
What do users say is the most common duty of
with machine learning were mentioned by more
being a data scientist? More than complex machine
than half of respondents.
learning, over 75% suggested understanding and
analyzing the data is a common activity. Perhaps
this explains how Kagglers are able to create so
many great EDA kernels in the first hour of every new
competition!

HOW DATA SCIENTISTS SPEND THEIR TIME

Analyze and
underst and dat a

Build prototypes
to explore
applying ML

Experiment and
iterate to
improve existing
ML models

Build and/
or run a ML
service
operationally

Build and/or run


the dat a
infrastructure

Do research that
advances the
st ate of the art
of ML

Other

None of these
activities are
important to my
role

0% 10% 20% 30% 40% 50% 60% 70% 80%


Companies Employing Data Science Data scientists tend to congregate at
both ends of the company size spectrum.
We asked data scientists more about the
The most common responses were from
organizations where they worked: number of
representatives of companies with less than
employees, team sizes, and how the companies have
50 employees. Next came the much larger
adopted machine learning practices.
companies with more than 10,000 employees.

COMPANY SIZE (# OF EMPLOYEES)

0-49

50-249

250-999

1,000-9,999

>10,000

0% 10% 20% 30%

Data Science Teams Of users that are currently employed as a data


scientist, 4% reported that they had a team size of
The size of the data science team varies, though
zero. Either these respondents weren’t counting
25% work on teams with 20 or more members.
themselves, or perhaps data science is only a
Combining the lower ranges, we see over 40% work
portion of their responsibilities.
on teams of fewer than five people.

DATA SCIENCE TEAMS (# OF EMPLOYEES)

1-2

3-4

5-9

10-14

15-19

20+

0% 10% 20% 30%


Enterprise Machine Learning Adoption
Matching other responses, machine learning is
becoming more popular. Over 30% of users say
their company has recently started using machine
learning methods and 17% say they’re exploring
machine learning methods. The percentage of data
scientists that work for companies with well estab-
lished machine learning methods increased by 11%
from 2018.

MACHINE LEARNING ADOPTION

40% 2019

2018

30%

20%

10%

0%

We are explor- We use ML We recently We have well No (we do not I do not know
ing ML methods methods for st arted using ML established ML use ML methods)
(and may one generating methods (i.e., methods (i.e.,
day put a model insights (but do models in models in pro-
into production) not put working production for duction for more
models into less than 2 years) than 2 years)
production)
Spending The story is different in the United States, where
a plurality (24%) have spent over $100,000 on
Internationally, the plurality of respondents (23%)
products in the past five years. Only 34% report
didn’t spend money on machine learning and
having spent less than $1000, compared to nearly
cloud computing products at all. That being said,
43% globally.
11% of respondents have spent more than $100,000.

ENTERPRISE SPENDING IN PAST 5 YEARS ($USD)

Global
30%
USA

20%

10%

0%

0 1-100 100-1k 1k-10k 10k-100k >100k


Technology

Interactive Development Environments


The most common analytics tools are by far local
development environments. Out of those, Jupyter-
Lab and its offshoots are the most common, with
83% of data scientists using it on a regular basis.

POPULAR IDE USAGE

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%
Jupyter RStudio PyCharm Visual Notepad Spyder Sublime Vim/ Atom MAT- Other None
(Jupyter- Studio/ ++ Text Emacs LAB
Lab, Visual
Jupy- Studio
terNot- Code
ebooks,
etc.)
Methods and Algorithms
Respondents are big fans of keeping it simple.
The most common methods are linear or logistic
regression, followed by decision trees. While not as
powerful as more complex techniques, they can still
be quite effective and are easier to interpret.

METHODS AND ALGORITHMS USAGE

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%
Linear or Decision Gradient Convo- Bayeian Dense Recurrent Transformer Genera- Evolution- Other None
Logistic Trees or Boosting lutional Ap- Neural Neural Networks tive Ad- ary Ap-
Regression Random Ma- Neural proaches Networks Network versarial proaches
Forests chines Networks Networks
As for the machine learning frameworks used
to employ their techniques, data scientists use
multiple tools. Over 80% use Scikit-learn, a Python
package containing popular data science algorithms.
TensorFlow and Keras, often used in combination,
continue to be the dominant deep learning
framework.

FRAMEWORKS USAGE

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%
Scikit- Keras Xgboost Tensor- Random- LightGBM PyTorch Caret Spark Fast .ai Other None
learn Flow Forest MLib
Enterprise Tools
Most professional data scientists are making use of
cloud computing, though over 24% still are not. AWS,
Google Cloud Platform, and Microsoft Azure are by
far the top three choices among data scientists using
cloud tools.

ENTERPRISE TOOLS USAGE

Amazon Web
Services (AWS

Google Cloud
Platform(GCP)

None

Microsoft Azure

IBM Cloud

Other

Oracle Cloud

VMwareCloud

Red Hat Cloud

Salesforce
Cloud

SAP Cloud

Alibaba Cloud

0% 10% 20% 30% 40% 50%


Automated Machine Learning
Particularly notable is the growth of Google Cloud
AutoML since last year’s survey. The number of
respondents using this machine learning platform
nearly doubled overall, with a similar rate of growth
among US-based data scientists.

GOOGLE CLOUD AUTOML

8% Global

USA

6%

4%

2%

0%

2018 2019
Conclusion

This 2019 edition of the State of Machine Learning


and Data Science includes insights gathered from
a survey of 19,717 Kaggle members. Their answers
covered demographic, education, employment, and
technology usage.

Much of the charts and results are culled from


professional data scientists (covering 21% of
respondents). There’s even more to uncover in the
most comprehensive dataset available on the state of
machine learning and data science today.

Kaggle has published the complete dataset of


responses for the community to review, and we’ll run
a competition from November 11 to December 2nd to
learn even more about data science practitioners in
2019.

You might also like