18 Useful Pandas Functions for Data Science
Introduction
Pandas is a widely used library in the data science community, it is a powerful tool that offers extensive capabilities for data manipulation, cleaning, and analysis. In this article, we will cover 18 most common dataframe functions that every Data Analyst and Data Scientist should know.
Practical Code Examples
1. Create Dummy Dataframe
When I am experimenting with new concepts or unfamiliar functions, I usually create a dummy dataframe. This allows me to quickly try out new packages or ideas from blog posts.
import pandas as pd
df = pd.DataFrame({
"A": ["a", "b", "a"],
"B": ["b", "a", "c"],
"C": [1, 2, 3],
})
print(df.head())
A B C
0 a b 1
1 b a 2
2 a c 3
2. Change Data Type/Column Name with Dictionary
It is easier and more readable to do Dataframe operation with dictionary.
import pandas as pd
df_a = pd.DataFrame({
"A": ["a", "b", "a"],
"B": ["1", "2", "3"],
"C": [1, 2, 3],
"D": ["2022-01-02", "2022-01-01", "2022-01-03"],
})
dtype0 = {
'A': 'str',
'B': 'int64',
'C': 'int64',
'D': 'datetime64[ns]',
}
df_b = df_a.astype(dtype0)
df_b.rename(
columns={"D": "start_date"},
inplace=True,
)
print(df_a.dtypes)
print(df_b.dtypes)
# Before (df_a)
A object
B object
C int64
D object
# After (df_b)
A object
B int64
C int64
start_date datetime64[ns]
3. Print Sample Data
Sometimes when we are doing some quick exploration analysis, we usually use some dummy variable names like df_aaa
, df_bbb
. It is nice to have some utility functions to print the first few rows of the dataframe, and the variable name as well.
import traceback
import re
import pandas as pd
def print_sample_data(df, n=5):
"""
Print top n records (max: 100) of the dataframe
Args:
df (Dataframe): Dataframe
n (int): Top n records to be displayed. Max = 100
Returns:
None
Example:
df_abc = pd.DataFrame.from_dict({'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']})
print_sample_data(df=df_abc, n=5)
"""
stack = traceback.extract_stack()
filename, lineno, function_name, code = stack[-2]
vars_name = re.compile(r"\((.*?)\).*$").search(code).groups()[0]
if n >= 100:
n = 100
print("\n---------------")
print(f"{vars_name}, shape={df.shape}\n")
print(df.head(n))
df_aaa = pd.DataFrame.from_dict({'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']})
df_bbb = df_aaa.copy()
print_sample_data(df_aaa, n=5)
print_sample_data(df_bbb, n=3)
---------------
df_aaa, n=5, shape=(4, 2)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
---------------
df_bbb, n=3, shape=(4, 2)
col_1 col_2
0 3 a
1 2 b
2 1 c
4. Round Numeric Columns
A wrapper function to round different columns into different digits. Possible use case would be round the percentage columns to 1 d.p. and sales data to nearest hundred.
import pandas as pd
def round_df(df, cols=None, digit=1):
"""
Round the numeric columns of the dataframe if provided to the nearest sig. fig.,
Round all numeric columns if not provided.
Args
df (DataFrame): Input DataFrame
col (List[str]): List of columns to be rounded. Default None for all numeric cols
digit (int): No of digits to be round to. e.g. 1 as 1 d.p.
Returns:
df (DataFrame): DataFrame with rounded columns
"""
numeric_cols = []
if cols is not None:
# Existing columns only
numeric_cols = list(set(cols).intersection(set(df.columns)))
else:
# All numeric columns
numeric_cols = df.select_dtypes("number").columns
for col in numeric_cols:
if df[col].dtype.kind == 'f':
df[col] = round(df[col], digit)
return df
df_before = pd.DataFrame.from_dict({
'col_1': [3.4455, 2.4111, 1.03394, 0],
'col_2': [3450.4455, 2490.4111, 1000.03394, 0],
'col_3': ['a', 'b', 'c', 'd'],
})
df_after = df_before.copy()
df_after = round_df(df=df_after, cols=["col_1"], digit=1)
df_after = round_df(df=df_after, cols=["col_2"], digit=-2)
print(df_before)
print(df_after)
# df_before
col_1 col_2 col_3
0 3.44550 3450.44550 a
1 2.47110 2490.41110 b
2 1.03394 1000.03394 c
3 0.00000 0.00000 d
# df_after
col_1 col_2 col_3
0 3.4 3500.0 a
1 2.5 2500.0 b
2 1.0 1000.0 c
3 0.0 0.0 d
5. Group by Aggregate
It is possible to do group by and use several aggregate functions using the tuple like syntax df.groupby(some_col).agg( new_col_name = (col, agg_func))
.
Ref: pandas.core.groupby.DataFrameGroupBy.agg.html
import numpy as np
import pandas as pd
df_before = pd.DataFrame({
"group": ["a", "a", "a", "b"],
"num_1": [1.21, 1.22, 1.23, 1.24],
"num_2": [2, 3, 4, 5],
})
df_after = df_before.groupby('group').agg(
sum_num_1=('num_1', 'sum'),
mean_num_1=('num_1', 'mean'),
max_num_2=('num_2', 'max'),
lambda_num_2=('num_2', lambda x: np.sum(x + 1)),
).reset_index()
print(df_before)
print(df_after)
# df_before
group num_1 num_2
0 a 1.21 2
1 a 1.22 3
2 a 1.23 4
3 b 1.24 5
# df_after
group sum_num_1 mean_num_1 max_num_2 lambda_num_2
0 a 3.66 1.22 4 12
1 b 1.24 1.24 5 6
6. Group by Transform
We can use group by transform function with syntax df["agg_col"] = df.groupby("group_col")["col"].transform(agg_func)
to add the summarize column to the original dataframe.
Use case: Please help to sum up the total sales and sort it by the sales ranking in descending order in the original dataset.
Ref: pandas.core.groupby.SeriesGroupBy.transform.html
import pandas as pd
import numpy as np
import itertools
df_before = pd.DataFrame({
'city': list(itertools.chain(*[
["chicago"] * 3,
["new_york"] * 4,
["san_francisco"] * 5,
])),
'sales': list(itertools.chain(*[
np.repeat(2, 3),
np.repeat(3, 4),
np.repeat(1, 5),
]))
})
df_after = df_before.copy()
df_after['total_rows'] = df_after.groupby('city')["sales"].transform(lambda x: len(x))
df_after['total_sales'] = df_after.groupby('city')["sales"].transform("sum")
df_after['sales_rank'] = df_after["total_sales"].rank(
method='dense',
ascending=False,
).astype(int)
df_after.sort_values(["sales_rank"], ascending=True, inplace=True)
print(df_before)
print(df_after)
# df_before
city sales
0 chicago 2
1 chicago 2
2 chicago 2
3 new_york 3
4 new_york 3
5 new_york 3
6 new_york 3
7 san_francisco 1
8 san_francisco 1
9 san_francisco 1
10 san_francisco 1
11 san_francisco 1
# df_after
city sales total_rows total_sales sales_rank
3 new_york 3 4 12 1
4 new_york 3 4 12 1
5 new_york 3 4 12 1
6 new_york 3 4 12 1
0 chicago 2 3 6 2
1 chicago 2 3 6 2
2 chicago 2 3 6 2
7 san_francisco 1 5 5 3
8 san_francisco 1 5 5 3
9 san_francisco 1 5 5 3
10 san_francisco 1 5 5 3
11 san_francisco 1 5 5 3
7. Group by Lambda
Sometimes when you want to do a more complex version of aggregation (e.g. custom code, multi steps business logics, etc..), you can use group by, lambda function
(syntax: df.groupby("group_col").apply(labmda x: custom_grouped_func(x))
) to do that.
Use case: Please help to create a simple regression model to each group, and use the model to predict the sales accordingly.
import pandas as pd
import numpy as np
import itertools
from sklearn import linear_model
np.random.seed(42)
def generate_seq(slope, intercept, size, mu=0, sigma=1):
"""
Generate (mx + b) sequence with size n plus a random gaussian noise.
"""
seq = list(range(1, size + 1, 1))
noise = np.random.normal(mu, sigma, size)
y = [intercept + slope * x for x in seq]
y = [sum(x) for x in zip(y, noise)]
return y
# Create a dummy dataframe for demostration
df_before = pd.DataFrame({
'city':
list(itertools.chain(*[
["chicago"] * 20,
["new_york"] * 30,
["san_francisco"] * 40,
])),
'period':
list(itertools.chain(*[
list(range(1, 20 + 1, 1)),
list(range(1, 30 + 1, 1)),
list(range(1, 40 + 1, 1)),
])),
'sales':
list(
itertools.chain(*[
generate_seq(slope=1, intercept=5, size=20),
generate_seq(slope=2, intercept=8, size=30),
generate_seq(slope=0.5, intercept=2, size=40),
]))
})
def simple_group_func(gdf: pd.DataFrame):
"""
A grouped function to do a simple linear regression on the given data, return the formula, and the predicted sales.
"""
x = gdf["period"].values
y = gdf["sales"].values
length = gdf.shape[0]
X = x.reshape(length, 1)
y = y.reshape(length, 1)
lr = linear_model.LinearRegression()
lr.fit(X, y)
slope = round(lr.coef_[0][0], 2)
intercept = round(lr.intercept_[0], 2)
# Add the result back to the original dataframe with the additional columns
gdf["formula"] = f"y={slope}*x + {intercept}"
gdf["actual"] = gdf["sales"]
gdf["predict"] = lr.predict(X) # usually we use predict for unseen data, but just keeping it simple in this case
return gdf
df_after = df_before.groupby("city").apply(lambda x: simple_group_func(x)).reset_index(drop=True)
print(df_before)
print(df_after)
# df_before
city period sales
0 chicago 1 6.496714
1 chicago 2 6.861736
2 chicago 3 8.647689
3 chicago 4 10.523030
4 chicago 5 9.765847
.. ... ... ...
85 san_francisco 36 19.498243
86 san_francisco 37 21.415402
87 san_francisco 38 21.328751
88 san_francisco 39 20.970240
89 san_francisco 40 22.513267
[90 rows x 3 columns]
# df_after. With 3 additional columns from the grouped function(["formula", "actual", "predict"])
city period sales formula actual predict
0 chicago 1 6.496714 y=0.9*x + 5.87 6.496714 6.774567
1 chicago 2 6.861736 y=0.9*x + 5.87 6.861736 7.675002
2 chicago 3 8.647689 y=0.9*x + 5.87 8.647689 8.575437
3 chicago 4 10.523030 y=0.9*x + 5.87 10.523030 9.475872
4 chicago 5 9.765847 y=0.9*x + 5.87 9.765847 10.376308
.. ... ... ... ... ... ...
85 san_francisco 36 19.498243 y=0.5*x + 2.14 19.498243 19.998570
86 san_francisco 37 21.415402 y=0.5*x + 2.14 21.415402 20.494638
87 san_francisco 38 21.328751 y=0.5*x + 2.14 21.328751 20.990705
88 san_francisco 39 20.970240 y=0.5*x + 2.14 20.970240 21.486773
89 san_francisco 40 22.513267 y=0.5*x + 2.14 22.513267 21.982840
[90 rows x 6 columns]
8. Long to Wide
Just like Excel pivot table.
Use case: Typical use case would be some Time series data. When dealing with time series data, you may have multiple observations for each time point. Converting the data from long to wide format allows you to have a single row for each time point, making it easier to perform calculations or visualize the data.
a) For Single record per group - pd.pivot
Ref: pandas.DataFrame.pivot.html
import pandas as pd
df_long = pd.DataFrame({
'group': [1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 ],
'sub_group': ["a" , "a" , "b" , "b" , "a" , "a" , "b" , "b" , "a" , "a" , "b" , "b" ],
'period': ["T1", "T2", "T1", "T2", "T1", "T2", "T1", "T2", "T1", "T2", "T1", "T2"],
'value_1': [10, 15, 20, 25, 30, 35, 1, 5, 30, 21, 32, 35 ],
'value_2': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6 ],
})
# Long to wide format using pivot
df_wide = df_long.pivot(
index=["group", "sub_group"],
columns="period",
values=["value_1", "value_2"],
)
# Flatten multi-level index dataframe
df_wide_flattened = df_wide.copy()
df_wide_flattened.columns = ["_".join(x) for x in df_wide_flattened.columns.to_flat_index()]
df_wide_flattened.reset_index(inplace=True)
print(df_long)
print(df_wide) # multi-level index
print(df_wide_flattened) # flattened level
# df_long
group sub_group period value_1 value_2
0 1 a T1 10 1
1 1 a T2 15 2
2 1 b T1 20 3
3 1 b T2 25 4
4 2 a T1 30 5
5 2 a T2 35 6
6 2 b T1 1 1
7 2 b T2 5 2
8 3 a T1 30 3
9 3 a T2 21 4
10 3 b T1 32 5
11 3 b T2 35 6
# df_wide (pivot)
value_1 value_2
period T1 T2 T1 T2
group sub_group
1 a 10 15 1 2
b 20 25 3 4
2 a 30 35 5 6
b 1 5 1 2
3 a 30 21 3 4
b 32 35 5 6
# df_wide_flattened (pivot)
group sub_group value_1_T1 value_1_T2 value_2_T1 value_2_T2
0 1 a 10 15 1 2
1 1 b 20 25 3 4
2 2 a 30 35 5 6
3 2 b 1 5 1 2
4 3 a 30 21 3 4
5 3 b 32 35 5 6
b) For Multiple records per group - pd.pivot_table
with aggfunc
For dataframe with multiple records per group, we need to use pd.pivot_table
instead of pd.pivot
, and provide an aggregate function as well. (e.g. sum, mean, etc..). Here I deliberately using a similar dataframe (removed the column sub_group, and allowed multiple records per group), so you can easily tell the difference between the two functions.
Ref: pandas.DataFrame.pivot_table.html
import pandas as pd
df_long = pd.DataFrame({
'group': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'period': ["T1", "T1", "T2", "T2", "T1", "T1", "T2", "T2", "T1", "T1", "T1", "T2"],
'value_1': [10, 15, 20, 25, 30, 35, 1, 5, 30, 21, 32, 35],
'value_2': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6],
})
# Long to wide format using pivot_table
df_wide = df_long.pivot_table(
index="group",
columns="period",
values=["value_1", "value_2"],
aggfunc="sum", # Need an aggregate function to handle multiple records per group/index
)
# Flatten multi-level index dataframe
df_wide_flattened = df_wide.copy()
df_wide_flattened.columns = ["_".join(x) for x in df_wide_flattened.columns.to_flat_index()]
df_wide_flattened.reset_index(inplace=True)
print(df_long)
print(df_wide)
print(df_wide_flattened)
# df_long
group period value_1 value_2
0 1 T1 10 1
1 1 T1 15 2
2 1 T2 20 3
3 1 T2 25 4
4 2 T1 30 5
5 2 T1 35 6
6 2 T2 1 1
7 2 T2 5 2
8 3 T1 30 3
9 3 T1 21 4
10 3 T1 32 5
11 3 T2 35 6
# df_wide (pivot_table)
value_1 value_2
period T1 T2 T1 T2
group
1 25 45 3 7
2 65 6 11 3
3 83 35 12 6
# df_wide_flattened (pivot_table)
group value_1_T1 value_1_T2 value_2_T1 value_2_T2
0 1 25 45 3 7
1 2 65 6 11 3
2 3 83 35 12 6
9. Wide to Long
Just a converse of the Long to Wide transformation.
Use case: Many visualization libraries, such as seaborn or matplotlib, expect data in a long format to create meaningful plots. By converting data from wide to long format, it becomes easier to plot and visualize relationships, trends, or comparisons across different variables.
Ref: pandas.melt.html
import pandas as pd
df_wide = pd.DataFrame({
'id': [1, 2, 3],
't1': [10, 20, 30],
't2': [15, 25, 35],
't3': [12, 22, 32],
})
# Wide to long format with melt
df_long = df_wide.melt(
id_vars='id',
var_name='time',
value_name='value',
)
print(df_wide)
print(df_long)
# df_wide
id t1 t2 t3
0 1 10 15 12
1 2 20 25 22
2 3 30 35 32
# df_long
id time value
0 1 t1 10
1 2 t1 20
2 3 t1 30
3 1 t2 15
4 2 t2 25
5 3 t2 35
6 1 t3 12
7 2 t3 22
8 3 t3 32
10. Merge two Dataframes
The pd.merge()
function is used to combine two or more DataFrames based on a common column or index. It is commonly used when you want to bring together related data from multiple sources.
Use case: Consider an example where we have two DataFrames representing orders and customers, and we want to merge them based on a shared column, such as “customer_id”
Ref: pandas.DataFrame.merge.html
import pandas as pd
df_order = pd.DataFrame({
'order_id': ["0001", "0002", "0003", "0004", "0005", "0006", "0007"],
'customer_id': [101, 101, 101, 103, 104, 105, None],
})
df_customer = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Evan'],
})
df_merge = pd.merge(
df_order,
df_customer,
on='customer_id',
how='left', # Left join, for non-login user
)
print(df_order)
print(df_customer)
print(df_merge)
# df_order
order_id customer_id
0 0001 101
1 0002 101
2 0003 101
3 0004 103
4 0005 104
5 0006 105
6 0007 NaN
# df_customer
customer_id name
0 101 Alice
1 102 Bob
2 103 Charlie
3 104 Dave
4 105 Evan
# df_merge
order_id customer_id name
0 0001 101 Alice
1 0002 101 Alice
2 0003 101 Alice
3 0004 103 Charlie
4 0005 104 Dave
5 0006 105 Evan
6 0007 NaN NaN # Left join, for non-login user
Advanced (With different keys, and added suffix)
import pandas as pd
# Sample dataframes
df_order = pd.DataFrame({
'order_id': ["0001", "0002", "0003", "0004", "0005", "0006", "0007"],
'name': ["jan order", "feb order", "march order", "april order", "may order", "june order", "july order"],
'c_id': [101, 101, 101, 103, 104, 105, None],
})
df_customer = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Evan'],
})
df_merge = pd.merge(
df_order,
df_customer,
left_on='c_id',
right_on='customer_id',
suffixes=('_order', '_customer'), # As both the dataframe has the same column 'name', a suffix is added to the columns accordingly
how='left',
)
print(df_order)
print(df_customer)
print(df_merge)
# df_order
order_id name c_id
0 0001 jan order 101
1 0002 feb order 101
2 0003 march order 101
3 0004 april order 103
4 0005 may order 104
5 0006 june order 105
6 0007 july order NaN
# df_customer
customer_id name
0 101 Alice
1 102 Bob
2 103 Charlie
3 104 Dave
4 105 Evan
# df_merge. Suffix are added to the columns "name" accordingly
order_id name_order c_id customer_id name_customer
0 0001 jan order 101 101 Alice
1 0002 feb order 101 101 Alice
2 0003 march order 101 101 Alice
3 0004 april order 103 103 Charlie
4 0005 may order 104 104 Dave
5 0006 june order 105 105 Evan
6 0007 july order NaN NaN NaN
11. cut and qcut
Create a new categorical column by cutting the numerical values in the different binning.
Use case: Educational grading, with absolute grading (binning with cut()
) or relative grading system (quantile cut with qcut()
).
Ref: pandas.cut.html and pandas.qcut.html
import pandas as pd
df_before = pd.DataFrame({
'student': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve', 'Frank', 'Grace', 'Helen'],
'score': [85, 92, 78, 64, 90, 82, 75, 88],
})
# Cut the scores into bins using cut function, with manual binning
df_after = df_before.copy()
df_after['grade_bin'] = pd.cut(
df_after['score'],
bins=[0, 60, 70, 80, 90, 100],
labels=['F', 'D', 'C', 'B', 'A'],
)
# Quantile based cut with qcut
df_after['grade_quantile'] = pd.qcut(
df_after['score'],
q=5,
labels=['F', 'D', 'C', 'B', 'A'],
)
print(df_before)
print(df_after)
# df_before
student score
0 Alice 85
1 Bob 92
2 Charlie 78
3 Dave 64
4 Eve 90
5 Frank 82
6 Grace 75
7 Helen 88
# df_after
(cut) (qcut)
student score grade_bin grade_quantile
0 Alice 85 B C
1 Bob 92 A A
2 Charlie 78 C D
3 Dave 64 D F
4 Eve 90 B A
5 Frank 82 B C
6 Grace 75 C F
7 Helen 88 B B
12. Cumulative Sum, Cumulative Percentage
Cumulative sum and percentage per group.
Use case: Market share analysis: Percentage is useful for analyzing market share. By calculating the percentage of a particular category or product’s sales relative to the total sales in the market, you can assess its market dominance and track changes in market dynamics easily.
import numpy as np
import pandas as pd
import itertools
df_before = pd.DataFrame({
'city': list(itertools.chain(*[
["chicago"] * 3,
["new_york"] * 4,
["san_francisco"] * 5,
])),
'sales': list(itertools.chain(*[
np.repeat(2, 3),
np.repeat(3, 4),
np.repeat(1, 5),
]))
})
# Add cumulative sum
df_after = df_before.copy()
df_after['csum'] = df_before.groupby(['city'])['sales'].cumsum()
# Add cumulative percentage
gb_obj = df_before.groupby('city')['sales']
df_after['cpct'] = round(gb_obj.cumsum() / gb_obj.transform('sum') * 100, 1)
print(df_before)
print(df_after)
# df_before
city sales
0 chicago 2
1 chicago 2
2 chicago 2
3 new_york 3
4 new_york 3
5 new_york 3
6 new_york 3
7 san_francisco 1
8 san_francisco 1
9 san_francisco 1
10 san_francisco 1
11 san_francisco 1
# df_after
city sales csum cpct
0 chicago 2 2 33.3
1 chicago 2 4 66.7
2 chicago 2 6 100.0
3 new_york 3 3 25.0
4 new_york 3 6 50.0
5 new_york 3 9 75.0
6 new_york 3 12 100.0
7 san_francisco 1 1 20.0
8 san_francisco 1 2 40.0
9 san_francisco 1 3 60.0
10 san_francisco 1 4 80.0
11 san_francisco 1 5 100.0
13. nuniques and describe
Just some exploratory data analysis (EDA) stuff. Print out the number of dinstinct values, 5 numbers ([0, 25, 50, 75, 100] percentile) of the columns.
import pandas as pd
# since pandas 0.19.2
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
print(df.nunique())
print(df.describe())
# df.nunique()
sepal_length 35
sepal_width 23
petal_length 43
petal_width 22
species 3
dtype: int64
# df.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
14. Crosstab
Crosstab function (syntax: pd.crosstab(df["col_a"), df["col_b"])
) is typically used when we are analyzing the relationship between two or more categorical variables. It creates a cross-tabulation or contingency table that summarizes the frequency or occurrence of different categories across multiple variables.
import pandas as pd
df_before = pd.DataFrame({
'gender': ['male', 'male', 'female', 'female', 'male', 'female', 'male', 'female'],
'education': ['high school', 'college', 'college', 'graduate', 'high school', 'graduate', 'college', 'graduate'],
'score': [1, 5, 6, 8, 3, 9, 10, 3]
})
df_after = pd.crosstab(df_before['gender'], df_before['education']).reset_index()
print(df_before)
print(df_after)
# df_before
gender education score
0 male high school 1
1 male college 5
2 female college 6
3 female graduate 8
4 male high school 3
5 female graduate 9
6 male college 10
7 female graduate 3
# df_after
education gender college graduate high school
0 female 1 3 0
1 male 2 0 2
15. Remap Column Value with map
Map values in the columns according to an input dictionary.
import pandas as pd
df_before = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 26 ],
'fruit': ["a", "b", "c", "d", "e", "z"],
})
m_fruit = {
"a": "apple",
"b": "banana",
"c": "coconut",
"d": "durian",
"e": "elderberry",
}
# Replace column value with the mapping dictionary, and use fillna as default value
df_after = df_before.copy()
df_after["fruit"] = df_after["fruit"].map(m_fruit).fillna("zucchini")
print(df_before)
print(df_after)
# df_before
id fruit
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 26 z
# df_after
id fruit
0 1 apple
1 2 banana
2 3 coconut
3 4 durian
4 5 elderberry
5 26 zucchini
16. Percentage of non-zero and na Values
Check no. of missing data in a Dataframe.
Ref: Percentage of missing values
import pandas as pd
import numpy as np
df_before = pd.DataFrame({
'col_1': [1, 2, 3, 0, 5, np.nan],
'col_2': ["a", "b", "c", None, "e", None],
'col_3': ["x", "x", "x", "x", "x", "x"],
})
# Method 1
df_after = ((df_before.isnull() | df_before.isna()).sum() / df_before.index.size) \
.round(2) \
.to_frame()
print(df_before)
print(df_after)
# Method 2
print(df_before.isnull().mean().round(2))
# df_before
col_1 col_2 col_3
0 1.0 a x
1 2.0 b x
2 3.0 c x
3 0.0 None x
4 5.0 e x
5 NaN None x
# df_after
0
col_1 0.17
col_2 0.33
col_3 0.00
17. Find a Specific Table from a Webpage
This one come to me during research. Seems pandas added a simple way to read HTML tables directly with the pd.read_html
function.
# Ref: https://www.geeksforgeeks.org/scraping-wikipedia-table-with-pandas-using-read_html
import pandas as pd
# Remarks: need lxml package
df = pd.read_html('https://en.wikipedia.org/wiki/Demographics_of_India',
match='Population distribution by states/union territories',
)
print(df[0].head())
Rank State/UT Population[59] ... Urban[60] Area[61] (km2) Density (per km2)
0 1 Uttar Pradesh 199812341 ... 44470455 240928 828
1 2 Maharashtra 112374333 ... 50827531 307713 365
2 3 Bihar 104099452 ... 11729609 94163 1102
3 4 West Bengal 91276115 ... 29134060 88752 1030
4 5 Madhya Pradesh 72626809 ... 20059666 308245 236
18. Sampling
Sampling is a common technique used in data analysis and statistics to make inferences about a larger population based on a subset of the data, it is also useful when you want to draw a representable size of data (sampled dataset) to do some quick analysis and make inferences about the larger population.
Ref: pandas.DataFrame.sample.html
import pandas as pd
import numpy as np
np.random.seed(123)
# Generate dummy dataframe with Gaussian distribution
mu, sigma, size = 0, 1, 100
dummy_data = np.random.normal(loc=mu, scale=sigma, size=size)
df_before = pd.DataFrame({'value': dummy_data})
# Sample without replacement
df_after = df_before.sample(
n=10,
replace=False,
).reset_index(drop=True)
print(df_before)
print(df_after)
# df_before
value
0 -1.085631
1 0.997345
2 0.282978
3 -1.506295
4 -0.578600
.. ...
95 1.031114
96 -1.084568
97 -1.363472
98 0.379401
99 -0.379176
# df_after (Sampling without replacement)
value
0 -0.443982
1 0.737369
2 -1.085631
3 0.283627
4 -1.605963
5 -1.253881
6 -0.638902
7 1.491390
8 1.177862
9 1.031114
Final Thoughts
The post is longer than I thought. Hope the examples can make your life as a data scientist eaiser and you can picked up some new skills from it.
Happy Coding!
Recommended Readings
Reference
- Photo by Stone Wang on Unsplash