Pandas Viva Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

PANDAS VIVA + THEORY REVISION

1) Define the Pandas/Python pandas?

Pandas is defined as an open-source library that provides high-performance data


manipulation in Python. The name of Pandas is derived from the word Panel Data, which
means an Econometrics from Multidimensional data. It can be used for data analysis in
Python and developed by Wes McKinney in 2008. It can perform five significant steps that
are required for processing and analysis of data irrespective of the origin of the data, i.e.,
load, manipulate, prepare, model, and analyze.

2) Mention the different types of Data Structures in Pandas?

Pandas provide two data structures, which are supported by the pandas library, Series, and
DataFrames. Both of these data structures are built on top of the NumPy.

3) Define Series in Pandas?

A Series is defined as a one-dimensional array that is capable of storing various data types.
The row labels of series are called the index. By using a 'series' method, we can easily
convert the list, tuple, and dictionary into series. A Series cannot contain multiple columns.

4) Define DataFrame in Pandas?

A DataFrame is a widely used data structure of pandas and works with a two-dimensional
array with labeled axes (rows and columns) DataFrame is defined as a standard way to store
data and has two different indexes, i.e., row index and column index. It consists of the
following properties:

The columns can be heterogeneous types like int and bool.

It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as "columns" in the case of columns and "index" in case of rows.

5)What are the significant features of the pandas Library?

The key features of the panda's library are as follows:

• Memory Efficient
• Data Alignment
• Reshaping
• Merge and join
• Time Series

6.Explain Reindexing in pandas?

Reindexing is used to conform DataFrame to a new index with optional filling logic. It places
NA/NaN in that location where the values are not present in the previous index. It returns a
new object unless the new index is produced as equivalent to the current one, and the value of
copy becomes False. It is used to change the index of the rows and columns of the
DataFrame.

7.How will you create a series from dict in Pandas?

A Series is defined as a one-dimensional array that is capable of storing various data types.

We can create a Pandas Series from Dictionary:

8. Create a Series from dict:

We can also create a Series from dict. If the dictionary object is being passed as an input and
the index is not specified, then the dictionary keys are taken in a sorted order to construct the
index.

If index is passed, then values correspond to a particular label in the index will be extracted
from the dictionary.

1. import pandas as pd
2. import numpy as np
3. info = {'x' : 0., 'y' : 1., 'z' : 2.}
4. a = pd.Series(info)
5. print (a)

9. How can we create a copy of the series in Pandas?

We can create the copy of series by using the following syntax:

pandas.Series.copy
Series.copy(deep=True)

The above statements make a deep copy that includes a copy of the data and the indices. If
we set the value of deep to False, it will neither copy the indices nor the data.

10. How will you create an empty DataFrame in Pandas?

A DataFrame is a widely used data structure of pandas and works with a two-dimensional
array with labeled axes (rows and columns) It is defined as a standard way to store data and
has two different indexes, i.e., row index and column index.

Create an empty DataFrame:

The below code shows how to create an empty DataFrame in Pandas:

1. # importing the pandas library


2. import pandas as pd
3. info = pd.DataFrame()
4. print (info)

Output:

Empty DataFrame
Columns: []
Index: []

11.How will you add a column to a pandas DataFrame?

We can add any new column to an existing DataFrame. The below code demonstrates how to
add any new column to an existing DataFrame:

1. # importing the pandas library


2. import pandas as pd
3. info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),
4. 'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
5.
6. info = pd.DataFrame(info)
7.
8. # Add a new column to an existing DataFrame object
9.
10. print ("Add new column by passing series")
11. info['three']=pd.Series([20,40,60],index=['a','b','c'])
12. print (info)
13. print ("Add new column using existing DataFrame columns")
14. info['four']=info['one']+info['three']
15. print (info)

12. How to Delete Indices, Rows or Columns From a Pandas Data Frame?

Deleting an Index from Your DataFrame

If you want to remove the index from the DataFrame, you should have to do the following:

Reset the index of DataFrame.

Executing del df.index.name to remove the index name.

Remove duplicate index values by resetting the index and drop the duplicate values from the
index column.

Remove an index with a row.


Deleting a Column from Your DataFrame

You can use the drop() method for deleting a column from the DataFrame.

The axis argument that is passed to the drop() method is either 0 if it indicates the rows and 1
if it drops the columns.

You can pass the argument inplace and set it to True to delete the column without reassign
the DataFrame.

You can also delete the duplicate values from the column by using the drop_duplicates()
method.

Removing a Row from Your DataFrame

By using df.drop_duplicates(), we can remove duplicate rows from the DataFrame.

You can use the drop() method to specify the index of the rows that we want to remove from
the DataFrame.

13. How to Rename the Index or Columns of a Pandas DataFrame?

You can use the .rename method to give different values to the columns or the index values
of DataFrame.

14. What is Pandas NumPy array?

Numerical Python (Numpy) is defined as a Python package used for performing the various
numerical computations and processing of the multidimensional and single-dimensional array
elements. The calculations using Numpy arrays are faster than the normal Python array.

15.Define GroupBy in Pandas?

In Pandas, groupby() function allows us to rearrange the data by utilizing them on real-world
data sets. Its primary task is to split the data into various groups. These groups are
categorized based on some criteria. The objects can be divided from any of their axes.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,


group_keys=True, squeeze=False, **kwargs)

16.How to Rename the Index or Columns of a Pandas DataFrame?


Ans: You can use the .rename method to give different values to the columns or the index
values of DataFrame.
There are the following ways to change index / columns names (labels) of pandas.DataFrame.
• Use pandas.DataFrame.rename()
• Change any index / columns names individually with dict
• Change all index / columns names with a function
• Use pandas.DataFrame.add_prefix(), pandas.DataFrame.add_suffix()
• Add prefix and suffix to columns name
• Update the index / columns attributes of pandas.DataFrame
• Replace all index / columns names

set_index() method that sets an existing column as an index is also provided. Specify the
original name and the new name in dict like {original name: new
name} to index / columns of rename().
index is for index name and columns is for the columns name. If you want to change either,
you need only specify one of index or columns.
A new DataFrame is returned, the original DataFrame is not changed.
df_new = df.rename(columns={'A': 'a'}, index={'ONE': 'one'})
print(df_new)
# a B C
# one 11 12 13
# TWO 21 22 23
# THREE 31 32 33

print(df)
# A B C
# ONE 11 12 13
# TWO 21 22 23
# THREE 31 32 33

18.How to iterate over a Pandas DataFrame?


Ans. You can iterate over the rows of the DataFrame by using for loop in combination with
an iterrows() call on the DataFrame.
import pandas as pd
import numpy as np

df = pd.DataFrame([{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}])


for index, row in df.iterrows():
print(row['c1'], row['c2'])

Output:
10 100
11 110
12 120

19. Define ReIndexing?


Ans: Reindexing changes the row labels and column labels of a DataFrame.
To reindex means to conform the data to match a given set of labels along a particular axis.
Multiple operations can be accomplished through indexing like −
• Reorder the existing data to match a new set of labels.
• Insert missing value (NA) markers in label locations where no data for the label existed.
Example
import pandas as pd
import numpy as np
N=20

df = pd.DataFrame({
'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
'x': np.linspace(0,stop=N-1,num=N),
'y': np.random.rand(N),
'C': np.random.choice(['Low','Medium','High'],N).tolist(),
'D': np.random.normal(100, 10, size=(N)).tolist()
})

#reindex the DataFrame


df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print df_reindexed
Its output is as follows −
A C B
0 2016-01-01 Low NaN
2 2016-01-03 High NaN
5 2016-01-06 Low NaN

20. How to Set the index?


Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.
Pandas set_index() is a method to set a List, Series or Data frame as index of a Data Frame.
Index column can be set while making a data frame too. But sometimes a data frame is made
out of two or more data frames and hence later index can be changed using this method.

Changing Index column


In this example, First Name column has been made the index column of Data Frame.
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# setting first name as index column


data.set_index("First Name", inplace = True)

# display
data.head()

21. Describe Data Operations in Pandas?


Ans: In Pandas, there are different useful data operations for DataFrame, which are as
follows:
• Row and column selection
We can select any row and column of the DataFrame by passing the name of the rows and
columns. When you select it from the DataFrame, it becomes one-dimensional and
considered as Series.
• Filter Data
We can filter the data by providing some of the boolean expressions in DataFrame.
• Null values
A Null value occurs when no data is provided to the items. The various columns may contain
no values, which are usually represented as NaN.

22.How will you get the number of rows and columns of a DataFrame in pandas?
get the row and column count of the df
df.shape
(4, 4)

23. TRANSPOSE
>>> df
t1 t2 t3
0 T T T
1 C G G
2 C C -
3 A A A
4 A A A

By default, characters are stored as rows and sequences as columns in the DataFrame. If you
want rows to hold sequences, just transpose the matrix in pandas:

>>> df.transpose()
0 1 2 3 4
t1 T C C A A
t2 T G C A A
t3 T G - A A

24. How can we select a column in pandas DataFrame?


# select two columns
df[['Name', 'Qualification']]

# select all rows


# and second to fourth column
df[df.columns[1:4]]

25. How can we retrieve a row in pandas DataFrame ?

# retrieving row by loc method


first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]

# retrieving rows by loc method


rows = data.loc[["Avery Bradley", "R.J. Hunter"]]
26. How can you check if a DataFrame is empty in pandas?

Now we will use DataFrame.empty attribute to check if the given dataframe is empty or not.
# check if there is any element
# in the given dataframe or not
result = df.empty
# Print the result
print(result)

27. How will you set the index of Dataframe or Series?

# Create the index


index_ = ['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']

# Set the index


df.index = index_

# Print the DataFrame


print(df)

28. How will you get the top 2 rows from a DataFrame in pandas?
# Select the first 2 rows of the Dataframe
dfObj1 = empDfObj.head(2)
print(“First 2 rows of the Dataframe : “)
print(dfObj1)

29. How To Write a Pandas DataFrame to a File


When you have done your data munging and manipulation with Pandas, you might want to
export the DataFrame to another format. This section will cover two ways of outputting your
DataFrame: to a CSV or to an Excel file.
Outputting a DataFrame to CSV
To output a Pandas DataFrame as a CSV file, you can use to_csv().
Writing a DataFrame to Excel
Very similar to what you did to output your DataFrame to CSV, you can use to_excel() to
write your table to Excel.

30. What is Vectorization in Python pandas?


Ans: Vectorization is the process of running operations on the entire array. This is done to
reduce the amount of iteration performed by the functions. Pandas have a number of
vectorized functions like aggregations, and string functions that are optimized to operate
specifically on series and DataFrames. So it is preferred to use the vectorized pandas
functions to execute the operations quickly.

31. List some statistical functions in Python Pandas?


Ans: Some of the statistical functions in Python Pandas are,
sum() – it returns the sum of the values.
mean() – returns the mean that is the average of the values.
std() – returns the standard deviation of the numerical columns.
min() – returns the minimum value.
max() – returns the maximum value.
abs() – returns the absolute value.
prod() – returns the product of the values.

32. What Are The Different Ways A DataFrame Can Be Created In pandas?
Ans:
DataFrame can be created in different ways here are some ways by which we create a
DataFrame:
• Using List:
# initialize list of lists
data = [[‘p’, 1], [‘q’, 2], [‘r’, 3]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = [‘Letter’, ‘Number’])
# print dataframe.
df
• Using dict of narray/lists:
To create DataFrame from dict of narray/list, all the narray must be of same length. If index
is passed then the length index should be equal to the length of arrays. If no index is passed,
then by default, index will be range(n) where n is the array length.
• Using arrays:
# DataFrame using arrays.
import pandas as pd
# initialise data of lists.
data = {‘Name’:[‘Tom’, ‘Jack’, ‘nick’, ‘juli’], ‘marks’:[99, 98, 95, 90]}
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =[‘rank1’, ‘rank2’, ‘rank3’, ‘rank4’])
# print the data
df
#using Series of Dictionary
33. # Adding Columns in dataframe

34. #iterrows and applying conditions

35. How to convert a DataFrame to an array in Pandas?


Ans: The function to_numpy() is used to convert the DataFrame to a NumPy array.
//syntax
DataFrame.to_numpy(self, dtype=None, copy=False)
The dtype parameter defines the data type to pass to the array and the copy ensures the
returned value is not a view on another array.

36. Difference between loc and iloc.


# importing the module
import pandas as pd

# creating a sample dataframe


data = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
'Mahindra', 'Maruti', 'Hyundai',
'Renault', 'Tata', 'Maruti'],
'Year' : [2012, 2014, 2011, 2015, 2012,
2016, 2014, 2018, 2019],
'Kms Driven' : [50000, 30000, 60000,
25000, 10000, 46000,
31000, 15000, 12000],
'City' : ['Gurgaon', 'Delhi', 'Mumbai',
'Delhi', 'Mumbai', 'Delhi',
'Mumbai','Chennai', 'Ghaziabad'],
'Mileage' : [28, 27, 25, 26, 28,
29, 24, 21, 24]})

# displaying the DataFrame


display(data)

loc() : loc() is label based data selecting method which means that we have to pass the
name of the row or column which we want to select. This method includes the last element
of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike iloc() .

. Selecting data according to some conditions :

# selecting cars with brand 'Maruti' and Mileage > 25

display(data.loc[(data.Brand == 'Maruti') & (data.Mileage > 25)])

2. Selecting a range of rows from the DataFrame :

# selecting range of rows from 2 to 5

display(data.loc[2 : 5])

3. Updating the value of any column :

# updating values of Mileage if Year < 2015

data.loc[(data.Year < 2015), ['Mileage']] = 22

display(data)
iloc() : iloc() is a indexed based selecting method which means that we have to pass

integer index in the method to select specific row/column. This method does not

include the last element of the range passed in it unlike loc(). iloc() does not

accept the boolean data unlike loc().

1. Selecting rows using integer indices:

# selecting 0th, 2th, 4th, and 7th index rows

display(data.iloc[[0, 2, 4, 7]])

2. Selecting a range of columns and rows simultaneously:

# selecting rows from 1 to 4 and columns from 2 to 4

display(data.iloc[1 : 5, 2 : 5])

37. Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying
directly index or column names. When using a multi-index, labels on different levels can be
removed by specifying the level.

Examples

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),


... columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
Drop a row by index

>>> df.drop([0, 1])


A B C D
2 8 9 10 11

# simultaneously both rows and columns

df.drop(index='cow', columns='small')
big
lama speed 45.0
weight 200.0
length 1.5
falcon speed 320.0
weight 1.0
length 0.3

iteritems()[source]
Iterate over (column name, Series) pairs.

Iterates over the DataFrame columns, returning a tuple with the column name and the
content as a Series.

DataFrame.iterrows
Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples
Iterate over DataFrame rows as namedtuples of the values.
df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'],
... 'population': [1864, 22000, 80000]},
... index=['panda', 'polar', 'koala'])
>>> df
species population
panda bear 1864
polar bear 22000
koala marsupial 80000
>>> for label, content in df.items():
... print(f'label: {label}')
... print(f'content: {content}', sep='\n')
...
label: species
content:
panda bear
polar bear
koala marsupial
Name: species, dtype: object
label: population
content:
panda 1864
polar 22000
koala 80000
Name: population, dtype: int64

DataFrame.itertuples(index=True, name='Pandas')[source]
Iterate over DataFrame rows as namedtuples.

Parameters
indexbool, default True
If True, return the index as the first element of the tuple.
namestr or None, default “Pandas”
The name of the returned namedtuples or None to return regular tuples.
df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},
... index=['dog', 'hawk'])
>>> df
num_legs num_wings
dog 4 0
hawk 2 2
>>> for row in df.itertuples():
... print(row)
...
Pandas(Index='dog', num_legs=4, num_wings=0)
Pandas(Index='hawk', num_legs=2, num_wings=2)
With the name parameter set we set a custom name for the yielded namedtuples:

>>> for row in df.itertuples(name='Animal'):


... print(row)
...
Animal(Index='dog', num_legs=4, num_wings=0)
Animal(Index='hawk', num_legs=2, num_wings=2)
pandas.DataFrame.pop
DataFrame.pop(item)[source]

Return item and drop from frame. Raise KeyError if not found.

df = pd.DataFrame([('falcon', 'bird', 389.0),


... ('parrot', 'bird', 24.0),
... ('lion', 'mammal', 80.5),
... ('monkey', 'mammal', np.nan)],
... columns=('name', 'class', 'max_speed'))
>>> df
name class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal NaN
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN

DataFrame.isna()

Detect missing values.

DataFrame.isnull

Alias of isna.

DataFrame.notna

Boolean inverse of isna.

DataFrame.dropna

Omit axes labels with missing values.


df = pd.DataFrame(dict(age=[5, 6, np.NaN],
... born=[pd.NaT, pd.Timestamp('1939-05-27'),
... pd.Timestamp('1940-04-25')],
... name=['Alfred', 'Batman', ''],
... toy=[None, 'Batmobile', 'Joker']))
>>> df
age born name toy
0 5.0 NaT Alfred None
1 6.0 1939-05-27 Batman Batmobile
2 NaN 1940-04-25 Joker
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])


>>> ser
0 5.0
1 6.0
2 NaN
dtype: float64
>>> ser.isna()
0 False
1 False
2 True
dtype: bool

pandas.DataFrame.size
Return an int representing the number of elements in this object.

Return the number of rows if Series. Otherwise return the number of rows times
number of columns if DataFrame.

>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3})


>>> s.size
3
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df.size
4
#
DROPPING OF ROWS/COLUMNS 1, USING AXIS

2. WITHOUT AXIS

AXIS

1. DF1.drop(labels=[rowlabel],axis=0)#Temporary deletion
2. DF1.drop(labels=[rowlabel1,rowlabel2],axis=0)#Temporary deletion

OR DF1.drop(rowlabel,axis=0) OR DF1.drop(rowlabel)

3. DF1.drop(labels=[rowlabel],axis=0,inplace=True)#Permanent deletion
DF1=DF1.drop(labels=[rowlabel],axis=0)#Permanent deletion Without AXIS
DF1.drop(index=[rowlabel])#Temporary deletion
DF1.drop(index=[rowlabel],inplace=True)#Permanent deletion
DF1=DF1.drop(index=[rowlabel])#Permanent deletion

# Accessing rows and columns with conditions

You might also like