Select columns in PySpark dataframe

Question

I am looking for a way to select columns of my dataframe in PySpark. For the first row, I know I can use df.first(), but not sure about columns given that they do not have column names.

I have 5 columns and want to loop through each one of them.

+--+---+---+---+---+---+---+
|_1| _2| _3| _4| _5| _6| _7|
+--+---+---+---+---+---+---+
|1 |0.0|0.0|0.0|1.0|0.0|0.0|
|2 |1.0|0.0|0.0|0.0|0.0|0.0|
|3 |0.0|0.0|1.0|0.0|0.0|0.0|

MaxU - stand with Ukraine · Accepted Answer · 2017-10-18 15:16:44Z

77

Try something like this:

df.select([c for c in df.columns if c in ['_2','_4','_5']]).show()

edited Oct 18, 2017 at 15:16

answered Oct 18, 2017 at 15:14

MaxU - stand with Ukraine

211k37 gold badges401 silver badges432 bronze badges

I do not want to hard code because I would have to do this for hundreds of columns. So I want to loop through the columns do some analysis.
– Nivi
Commented Oct 18, 2017 at 15:16
@Nivi, i've updated my answer - is that what you want?
– MaxU - stand with Ukraine
Commented Oct 18, 2017 at 15:17
Ah! that is a straight fwd way that I have been using for so long. I just went blank now. Thanks MAx :)
– Nivi
Commented Oct 18, 2017 at 15:18
1

@rishijain, stackoverflow.com/posts/46813599/revisions ;-) reason
– MaxU - stand with Ukraine
Commented Jun 10, 2019 at 3:27
1

this helped me.
– abdoulsn
Commented Oct 29, 2019 at 22:53

Add a comment |

Romain · Accepted Answer · 2018-08-30 15:27:34Z

36

First two columns and 5 rows

 df.select(df.columns[:2]).take(5)

edited Aug 30, 2018 at 15:27

Romain

21.8k6 gold badges62 silver badges75 bronze badges

answered Mar 29, 2018 at 21:14

Michael West

1,7061 gold badge16 silver badges24 bronze badges

Add a comment |

Shadowtrooper · Accepted Answer · 2019-08-14 12:41:52Z

27

You can use an array and unpack it inside the select:

cols = ['_2','_4','_5']
df.select(*cols).show()

answered Aug 14, 2019 at 12:41

Shadowtrooper

1,45416 silver badges31 bronze badges

1

This solution worked for my problem, but what does the * operator do?
– yeliabsalohcin
Commented Jan 3, 2020 at 13:45
3

@yeliabsalohcin * operator is to un pack the array. Pysparks select function does not accept an array.
– Shadowtrooper
Commented Jan 3, 2020 at 13:50
3

be careful if you have special characters like '.' in your column names, then you have surround each string with backticks '`'
– Jas
Commented May 9, 2020 at 0:26
Can you help me implement this? Select multiple column names that contain '.' ?
– charlie_boy
Commented Feb 23, 2021 at 2:31
Yes @charlie_boy , for this case, you can filter the column names using list comprehension: cols = [x for x in columns if "." in x]. Here, columns is a list with your column names. Also, be carefull with "." character in your column names, it have to be with backticks.
– Shadowtrooper
Commented Feb 23, 2021 at 7:59

Add a comment |

Mykola Zotko · Accepted Answer · 2020-12-28 17:11:03Z

The method select accepts a list of column names (string) or expressions (Column) as a parameter. To select columns you can use:

-- column names (strings):

df.select('col_1','col_2','col_3')

-- column objects:

import pyspark.sql.functions as F

df.select(F.col('col_1'), F.col('col_2'), F.col('col_3'))

# or

df.select(df.col_1, df.col_2, df.col_3)

# or

df.select(df['col_1'], df['col_2'], df['col_3'])

-- a list of column names or column objects:

df.select(*['col_1','col_2','col_3'])

#or

df.select(*[F.col('col_1'), F.col('col_2'), F.col('col_3')])

#or 

df.select(*[df.col_1, df.col_2, df.col_3])

The star operator * can be omitted as it's used to keep it consistent with other functions like drop that don't accept a list as a parameter.

desertnaut · Accepted Answer · 2017-10-18 15:20:15Z

5

Use df.schema.names:

spark.version
# u'2.2.0'

df = spark.createDataFrame([("foo", 1), ("bar", 2)])
df.show()
# +---+---+ 
# | _1| _2|
# +---+---+
# |foo|  1| 
# |bar|  2|
# +---+---+

df.schema.names
# ['_1', '_2']

for i in df.schema.names:
  # df_new = df.withColumn(i, [do-something])
  print i
# _1
# _2

answered Oct 18, 2017 at 15:20

desertnaut

60.2k31 gold badges151 silver badges176 bronze badges

Add a comment |

IOz · Accepted Answer · 2019-04-09 04:47:43Z

The dataset in ss.csv contains some columns I am interested in:

ss_ = spark.read.csv("ss.csv", header= True, 
                      inferSchema = True)
ss_.columns

['Reporting Area', 'MMWR Year', 'MMWR Week', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Med', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Med, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Max', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Previous 52 weeks Max, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2018', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2018, flag', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2017', 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Cum 2017, flag', 'Shiga toxin-producing Escherichia coli, Current week', 'Shiga toxin-producing Escherichia coli, Current week, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Med, flag', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max', 'Shiga toxin-producing Escherichia coli, Previous 52 weeks Max, flag', 'Shiga toxin-producing Escherichia coli, Cum 2018', 'Shiga toxin-producing Escherichia coli, Cum 2018, flag', 'Shiga toxin-producing Escherichia coli, Cum 2017', 'Shiga toxin-producing Escherichia coli, Cum 2017, flag', 'Shigellosis, Current week', 'Shigellosis, Current week, flag', 'Shigellosis, Previous 52 weeks Med', 'Shigellosis, Previous 52 weeks Med, flag', 'Shigellosis, Previous 52 weeks Max', 'Shigellosis, Previous 52 weeks Max, flag', 'Shigellosis, Cum 2018', 'Shigellosis, Cum 2018, flag', 'Shigellosis, Cum 2017', 'Shigellosis, Cum 2017, flag']

but I only need a few:

columns_lambda = lambda k: k.endswith(', Current week') or k == 'Reporting Area' or k == 'MMWR Year' or  k == 'MMWR Week'

The filter returns the list of desired columns, list is evaluated:

sss = filter(columns_lambda, ss_.columns)
to_keep = list(sss)

the list of desired columns is unpacked as arguments to dataframe select function that return dataset containing only columns in the list:

dfss = ss_.select(*to_keep)
dfss.columns

The result:

['Reporting Area',
 'MMWR Year',
 'MMWR Week',
 'Salmonellosis (excluding Paratyphoid fever andTyphoid fever)†, Current week',
 'Shiga toxin-producing Escherichia coli, Current week',
 'Shigellosis, Current week']

The df.select() has a complementary pair: http://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop

to drop the list of columns.

Collectives™ on Stack Overflow

Select columns in PySpark dataframe

6 Answers 6

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
apache-spark-sql
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonapache-sparkpysparkapache-spark-sql or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
apache-spark
pyspark
apache-spark-sql
or ask your own question.