18 Useful Pandas Functions for Data Science

Posted on October 18, 2023 in python

Introduction

Pandas is a widely used library in the data science community, it is a powerful tool that offers extensive capabilities for data manipulation, cleaning, and analysis. In this article, we will cover 18 most common dataframe functions that every Data Analyst and Data Scientist should know.

Practical Code Examples

1. Create Dummy Dataframe

When I am experimenting with new concepts or unfamiliar functions, I usually create a dummy dataframe. This allows me to quickly try out new packages or ideas from blog posts.

import pandas as pd
df = pd.DataFrame({
    "A": ["a", "b", "a"],
    "B": ["b", "a", "c"],
    "C": [1, 2, 3],
})
print(df.head())

   A  B  C
0  a  b  1
1  b  a  2
2  a  c  3

2. Change Data Type/Column Name with Dictionary

It is easier and more readable to do Dataframe operation with dictionary.

import pandas as pd

df_a = pd.DataFrame({
    "A": ["a", "b", "a"],
    "B": ["1", "2", "3"],
    "C": [1, 2, 3],
    "D": ["2022-01-02", "2022-01-01", "2022-01-03"],
})
dtype0 = {
    'A': 'str',
    'B': 'int64',
    'C': 'int64',
    'D': 'datetime64[ns]',
}
df_b = df_a.astype(dtype0)

df_b.rename(
    columns={"D": "start_date"},
    inplace=True,
)

print(df_a.dtypes)
print(df_b.dtypes)

# Before (df_a)
A    object
B    object
C     int64
D    object

# After (df_b)
A                     object
B                      int64
C                      int64
start_date    datetime64[ns]

3. Print Sample Data

Sometimes when we are doing some quick exploration analysis, we usually use some dummy variable names like df_aaa, df_bbb. It is nice to have some utility functions to print the first few rows of the dataframe, and the variable name as well.

import traceback
import re
import pandas as pd

def print_sample_data(df, n=5):
    """
    Print top n records (max: 100) of the dataframe

    Args:
        df (Dataframe): Dataframe
        n (int): Top n records to be displayed. Max = 100

    Returns:
        None

    Example:
        df_abc = pd.DataFrame.from_dict({'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']})
        print_sample_data(df=df_abc, n=5)

    """
    stack = traceback.extract_stack()
    filename, lineno, function_name, code = stack[-2]
    vars_name = re.compile(r"\((.*?)\).*$").search(code).groups()[0]

    if n >= 100:
        n = 100

    print("\n---------------")
    print(f"{vars_name}, shape={df.shape}\n")
    print(df.head(n))

df_aaa = pd.DataFrame.from_dict({'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']})
df_bbb = df_aaa.copy()
print_sample_data(df_aaa, n=5)
print_sample_data(df_bbb, n=3)

---------------
df_aaa, n=5, shape=(4, 2)

   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

---------------
df_bbb, n=3, shape=(4, 2)

   col_1 col_2
0      3     a
1      2     b
2      1     c

4. Round Numeric Columns

A wrapper function to round different columns into different digits. Possible use case would be round the percentage columns to 1 d.p. and sales data to nearest hundred.

import pandas as pd

def round_df(df, cols=None, digit=1):
    """
    Round the numeric columns of the dataframe if provided to the nearest sig. fig.,
    Round all numeric columns if not provided.

    Args
        df (DataFrame): Input DataFrame
        col (List[str]): List of columns to be rounded. Default None for all numeric cols
        digit (int): No of digits to be round to. e.g. 1 as 1 d.p.

    Returns:
        df (DataFrame): DataFrame with rounded columns
    """
    numeric_cols = []
    if cols is not None:
        # Existing columns only
        numeric_cols = list(set(cols).intersection(set(df.columns)))
    else:
        # All numeric columns
        numeric_cols = df.select_dtypes("number").columns

    for col in numeric_cols:
        if df[col].dtype.kind == 'f':
            df[col] = round(df[col], digit)

    return df

df_before = pd.DataFrame.from_dict({
    'col_1': [3.4455, 2.4111, 1.03394, 0],
    'col_2': [3450.4455, 2490.4111, 1000.03394, 0],
    'col_3': ['a', 'b', 'c', 'd'],
})

df_after = df_before.copy()
df_after = round_df(df=df_after, cols=["col_1"], digit=1)
df_after = round_df(df=df_after, cols=["col_2"], digit=-2)

print(df_before)
print(df_after)

# df_before
     col_1       col_2 col_3
0  3.44550  3450.44550     a
1  2.47110  2490.41110     b
2  1.03394  1000.03394     c
3  0.00000     0.00000     d

# df_after
   col_1   col_2 col_3
0    3.4  3500.0     a
1    2.5  2500.0     b
2    1.0  1000.0     c
3    0.0     0.0     d

5. Group by Aggregate

It is possible to do group by and use several aggregate functions using the tuple like syntax df.groupby(some_col).agg( new_col_name = (col, agg_func)).

Ref: pandas.core.groupby.DataFrameGroupBy.agg.html

import numpy as np
import pandas as pd

df_before = pd.DataFrame({
    "group": ["a", "a", "a", "b"],
    "num_1": [1.21, 1.22, 1.23, 1.24],
    "num_2": [2, 3, 4, 5],
})

df_after = df_before.groupby('group').agg(
    sum_num_1=('num_1', 'sum'),
    mean_num_1=('num_1', 'mean'),
    max_num_2=('num_2', 'max'),
    lambda_num_2=('num_2', lambda x: np.sum(x + 1)),
).reset_index()

print(df_before)
print(df_after)

# df_before
  group  num_1  num_2
0     a   1.21      2
1     a   1.22      3
2     a   1.23      4
3     b   1.24      5

# df_after
  group  sum_num_1  mean_num_1  max_num_2  lambda_num_2
0     a       3.66        1.22          4            12
1     b       1.24        1.24          5             6

6. Group by Transform

We can use group by transform function with syntax df["agg_col"] = df.groupby("group_col")["col"].transform(agg_func) to add the summarize column to the original dataframe.

Use case: Please help to sum up the total sales and sort it by the sales ranking in descending order in the original dataset.

Ref: pandas.core.groupby.SeriesGroupBy.transform.html

import pandas as pd
import numpy as np
import itertools

df_before = pd.DataFrame({
    'city': list(itertools.chain(*[
        ["chicago"] * 3,
        ["new_york"] * 4,
        ["san_francisco"] * 5,
    ])),
    'sales': list(itertools.chain(*[
        np.repeat(2, 3),
        np.repeat(3, 4),
        np.repeat(1, 5),
    ]))
})

df_after = df_before.copy()
df_after['total_rows']  = df_after.groupby('city')["sales"].transform(lambda x: len(x))
df_after['total_sales'] = df_after.groupby('city')["sales"].transform("sum")

df_after['sales_rank']  = df_after["total_sales"].rank(
    method='dense',
    ascending=False,
).astype(int)
df_after.sort_values(["sales_rank"], ascending=True, inplace=True)

print(df_before)
print(df_after)

# df_before
             city  sales
0         chicago      2
1         chicago      2
2         chicago      2
3        new_york      3
4        new_york      3
5        new_york      3
6        new_york      3
7   san_francisco      1
8   san_francisco      1
9   san_francisco      1
10  san_francisco      1
11  san_francisco      1

# df_after
             city  sales  total_rows  total_sales  sales_rank
3        new_york      3           4           12           1
4        new_york      3           4           12           1
5        new_york      3           4           12           1
6        new_york      3           4           12           1
0         chicago      2           3            6           2
1         chicago      2           3            6           2
2         chicago      2           3            6           2
7   san_francisco      1           5            5           3
8   san_francisco      1           5            5           3
9   san_francisco      1           5            5           3
10  san_francisco      1           5            5           3
11  san_francisco      1           5            5           3

7. Group by Lambda

Sometimes when you want to do a more complex version of aggregation (e.g. custom code, multi steps business logics, etc..), you can use group by, lambda function (syntax: df.groupby("group_col").apply(labmda x: custom_grouped_func(x)) ) to do that.

Use case: Please help to create a simple regression model to each group, and use the model to predict the sales accordingly.

import pandas as pd
import numpy as np
import itertools
from sklearn import linear_model

np.random.seed(42)

def generate_seq(slope, intercept, size, mu=0, sigma=1):
    """
    Generate (mx + b) sequence with size n plus a random gaussian noise.
    """
    seq = list(range(1, size + 1, 1))
    noise = np.random.normal(mu, sigma, size)

    y = [intercept + slope * x for x in seq]
    y = [sum(x) for x in zip(y, noise)]
    return y

# Create a dummy dataframe for demostration
df_before = pd.DataFrame({
    'city':
        list(itertools.chain(*[
            ["chicago"] * 20,
            ["new_york"] * 30,
            ["san_francisco"] * 40,
        ])),
    'period':
        list(itertools.chain(*[
            list(range(1, 20 + 1, 1)),
            list(range(1, 30 + 1, 1)),
            list(range(1, 40 + 1, 1)),
        ])),
    'sales':
        list(
            itertools.chain(*[
                generate_seq(slope=1, intercept=5, size=20),
                generate_seq(slope=2, intercept=8, size=30),
                generate_seq(slope=0.5, intercept=2, size=40),
            ]))
})

def simple_group_func(gdf: pd.DataFrame):
    """
    A grouped function to do a simple linear regression on the given data, return the formula, and the predicted sales.
    """

    x = gdf["period"].values
    y = gdf["sales"].values

    length = gdf.shape[0]

    X = x.reshape(length, 1)
    y = y.reshape(length, 1)

    lr = linear_model.LinearRegression()

    lr.fit(X, y)
    slope = round(lr.coef_[0][0], 2)
    intercept = round(lr.intercept_[0], 2)

    # Add the result back to the original dataframe with the additional columns
    gdf["formula"] = f"y={slope}*x + {intercept}"
    gdf["actual"] = gdf["sales"]
    gdf["predict"] = lr.predict(X)  # usually we use predict for unseen data, but just keeping it simple in this case

    return gdf

df_after = df_before.groupby("city").apply(lambda x: simple_group_func(x)).reset_index(drop=True)
print(df_before)
print(df_after)

# df_before
             city  period      sales
0         chicago       1   6.496714
1         chicago       2   6.861736
2         chicago       3   8.647689
3         chicago       4  10.523030
4         chicago       5   9.765847
..            ...     ...        ...
85  san_francisco      36  19.498243
86  san_francisco      37  21.415402
87  san_francisco      38  21.328751
88  san_francisco      39  20.970240
89  san_francisco      40  22.513267

[90 rows x 3 columns]

# df_after. With 3 additional columns from the grouped function(["formula", "actual", "predict"])
             city  period      sales         formula     actual    predict
0         chicago       1   6.496714  y=0.9*x + 5.87   6.496714   6.774567
1         chicago       2   6.861736  y=0.9*x + 5.87   6.861736   7.675002
2         chicago       3   8.647689  y=0.9*x + 5.87   8.647689   8.575437
3         chicago       4  10.523030  y=0.9*x + 5.87  10.523030   9.475872
4         chicago       5   9.765847  y=0.9*x + 5.87   9.765847  10.376308
..            ...     ...        ...             ...        ...        ...
85  san_francisco      36  19.498243  y=0.5*x + 2.14  19.498243  19.998570
86  san_francisco      37  21.415402  y=0.5*x + 2.14  21.415402  20.494638
87  san_francisco      38  21.328751  y=0.5*x + 2.14  21.328751  20.990705
88  san_francisco      39  20.970240  y=0.5*x + 2.14  20.970240  21.486773
89  san_francisco      40  22.513267  y=0.5*x + 2.14  22.513267  21.982840

[90 rows x 6 columns]

8. Long to Wide

Just like Excel pivot table.

Use case: Typical use case would be some Time series data. When dealing with time series data, you may have multiple observations for each time point. Converting the data from long to wide format allows you to have a single row for each time point, making it easier to perform calculations or visualize the data.

a) For Single record per group - pd.pivot

Ref: pandas.DataFrame.pivot.html

import pandas as pd

df_long = pd.DataFrame({
    'group':     [1   , 1   , 1   , 1   , 2   , 2   , 2   , 2   , 3   , 3   , 3   , 3   ],
    'sub_group': ["a" , "a" , "b" , "b" , "a" , "a" , "b" , "b" , "a" , "a" , "b" , "b" ],
    'period':    ["T1", "T2", "T1", "T2", "T1", "T2", "T1", "T2", "T1", "T2", "T1", "T2"],
    'value_1':   [10,   15,   20,   25,   30,   35,   1,    5,    30,   21,   32,   35  ],
    'value_2':   [1,    2,    3,    4,    5,    6,    1,    2,    3,    4,    5,    6   ],
})

# Long to wide format using pivot
df_wide = df_long.pivot(
    index=["group", "sub_group"],
    columns="period",
    values=["value_1", "value_2"],
)

# Flatten multi-level index dataframe
df_wide_flattened = df_wide.copy()
df_wide_flattened.columns = ["_".join(x) for x in df_wide_flattened.columns.to_flat_index()]
df_wide_flattened.reset_index(inplace=True)

print(df_long)
print(df_wide)            # multi-level index
print(df_wide_flattened)  # flattened level

# df_long
    group sub_group period  value_1  value_2
0       1         a     T1       10        1
1       1         a     T2       15        2
2       1         b     T1       20        3
3       1         b     T2       25        4
4       2         a     T1       30        5
5       2         a     T2       35        6
6       2         b     T1        1        1
7       2         b     T2        5        2
8       3         a     T1       30        3
9       3         a     T2       21        4
10      3         b     T1       32        5
11      3         b     T2       35        6

# df_wide (pivot)
                value_1     value_2
period               T1  T2      T1 T2
group sub_group
1     a              10  15       1  2
      b              20  25       3  4
2     a              30  35       5  6
      b               1   5       1  2
3     a              30  21       3  4
      b              32  35       5  6

# df_wide_flattened (pivot)
   group sub_group  value_1_T1  value_1_T2  value_2_T1  value_2_T2
0      1         a          10          15           1           2
1      1         b          20          25           3           4
2      2         a          30          35           5           6
3      2         b           1           5           1           2
4      3         a          30          21           3           4
5      3         b          32          35           5           6

b) For Multiple records per group - pd.pivot_table with aggfunc

For dataframe with multiple records per group, we need to use pd.pivot_table instead of pd.pivot, and provide an aggregate function as well. (e.g. sum, mean, etc..). Here I deliberately using a similar dataframe (removed the column sub_group, and allowed multiple records per group), so you can easily tell the difference between the two functions.

Ref: pandas.DataFrame.pivot_table.html

import pandas as pd

df_long = pd.DataFrame({
    'group':   [1,    1,    1,    1,    2,    2,    2,    2,    3,    3,    3,    3],
    'period':  ["T1", "T1", "T2", "T2", "T1", "T1", "T2", "T2", "T1", "T1", "T1", "T2"],
    'value_1': [10,   15,   20,   25,   30,   35,   1,    5,    30,   21,   32,   35],
    'value_2': [1,    2,    3,    4,    5,    6,    1,    2,    3,    4,    5,    6],
})

# Long to wide format using pivot_table
df_wide = df_long.pivot_table(
    index="group",
    columns="period",
    values=["value_1", "value_2"],
    aggfunc="sum",  # Need an aggregate function to handle multiple records per group/index
)

# Flatten multi-level index dataframe
df_wide_flattened = df_wide.copy()
df_wide_flattened.columns = ["_".join(x) for x in df_wide_flattened.columns.to_flat_index()]
df_wide_flattened.reset_index(inplace=True)

print(df_long)
print(df_wide)
print(df_wide_flattened)

# df_long
    group period  value_1  value_2
0       1     T1       10        1
1       1     T1       15        2
2       1     T2       20        3
3       1     T2       25        4
4       2     T1       30        5
5       2     T1       35        6
6       2     T2        1        1
7       2     T2        5        2
8       3     T1       30        3
9       3     T1       21        4
10      3     T1       32        5
11      3     T2       35        6

# df_wide (pivot_table)
       value_1     value_2
period      T1  T2      T1 T2
group
1           25  45       3  7
2           65   6      11  3
3           83  35      12  6

# df_wide_flattened (pivot_table)
   group  value_1_T1  value_1_T2  value_2_T1  value_2_T2
0      1          25          45           3           7
1      2          65           6          11           3
2      3          83          35          12           6

9. Wide to Long

Just a converse of the Long to Wide transformation.

Use case: Many visualization libraries, such as seaborn or matplotlib, expect data in a long format to create meaningful plots. By converting data from wide to long format, it becomes easier to plot and visualize relationships, trends, or comparisons across different variables.

Ref: pandas.melt.html

import pandas as pd

df_wide = pd.DataFrame({
    'id': [1, 2, 3],
    't1': [10, 20, 30],
    't2': [15, 25, 35],
    't3': [12, 22, 32],
})

# Wide to long format with melt
df_long = df_wide.melt(
    id_vars='id',
    var_name='time',
    value_name='value',
)

print(df_wide)
print(df_long)

# df_wide
   id  t1  t2  t3
0   1  10  15  12
1   2  20  25  22
2   3  30  35  32

# df_long
   id time  value
0   1   t1     10
1   2   t1     20
2   3   t1     30
3   1   t2     15
4   2   t2     25
5   3   t2     35
6   1   t3     12
7   2   t3     22
8   3   t3     32

10. Merge two Dataframes

The pd.merge() function is used to combine two or more DataFrames based on a common column or index. It is commonly used when you want to bring together related data from multiple sources.

Use case: Consider an example where we have two DataFrames representing orders and customers, and we want to merge them based on a shared column, such as “customer_id”

Ref: pandas.DataFrame.merge.html

import pandas as pd

df_order = pd.DataFrame({
    'order_id':    ["0001", "0002", "0003", "0004", "0005", "0006", "0007"],
    'customer_id': [101,    101,    101,    103,    104,    105,    None],
})

df_customer = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Evan'],
})

df_merge = pd.merge(
    df_order,
    df_customer,
    on='customer_id',
    how='left',  # Left join, for non-login user
)

print(df_order)
print(df_customer)
print(df_merge)

# df_order
  order_id  customer_id
0     0001        101
1     0002        101
2     0003        101
3     0004        103
4     0005        104
5     0006        105
6     0007        NaN

# df_customer
   customer_id     name
0          101    Alice
1          102      Bob
2          103  Charlie
3          104     Dave
4          105     Evan

# df_merge
  order_id  customer_id     name
0     0001          101    Alice
1     0002          101    Alice
2     0003          101    Alice
3     0004          103  Charlie
4     0005          104     Dave
5     0006          105     Evan
6     0007          NaN      NaN  # Left join, for non-login user

Advanced (With different keys, and added suffix)

import pandas as pd

# Sample dataframes
df_order = pd.DataFrame({
    'order_id': ["0001", "0002", "0003", "0004", "0005", "0006", "0007"],
    'name': ["jan order", "feb order", "march order", "april order", "may order", "june order", "july order"],
    'c_id': [101, 101, 101, 103, 104, 105, None],
})

df_customer = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Evan'],
})

df_merge = pd.merge(
    df_order,
    df_customer,
    left_on='c_id',
    right_on='customer_id',
    suffixes=('_order', '_customer'),  # As both the dataframe has the same column 'name', a suffix is added to the columns accordingly
    how='left',
)

print(df_order)
print(df_customer)
print(df_merge)

# df_order
  order_id         name   c_id
0     0001    jan order    101
1     0002    feb order    101
2     0003  march order    101
3     0004  april order    103
4     0005    may order    104
5     0006   june order    105
6     0007   july order    NaN

# df_customer
   customer_id     name
0          101    Alice
1          102      Bob
2          103  Charlie
3          104     Dave
4          105     Evan

# df_merge. Suffix are added to the columns "name" accordingly
  order_id   name_order   c_id  customer_id name_customer
0     0001    jan order    101          101         Alice
1     0002    feb order    101          101         Alice
2     0003  march order    101          101         Alice
3     0004  april order    103          103       Charlie
4     0005    may order    104          104          Dave
5     0006   june order    105          105          Evan
6     0007   july order    NaN          NaN           NaN

11. cut and qcut

Create a new categorical column by cutting the numerical values in the different binning.

Use case: Educational grading, with absolute grading (binning with cut()) or relative grading system (quantile cut with qcut()).

Ref: pandas.cut.html and pandas.qcut.html

import pandas as pd

df_before = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve', 'Frank', 'Grace', 'Helen'],
    'score': [85, 92, 78, 64, 90, 82, 75, 88],
})

# Cut the scores into bins using cut function, with manual binning
df_after = df_before.copy()
df_after['grade_bin'] = pd.cut(
    df_after['score'],
    bins=[0, 60, 70, 80, 90, 100],
    labels=['F', 'D', 'C', 'B', 'A'],
)

# Quantile based cut with qcut
df_after['grade_quantile'] = pd.qcut(
    df_after['score'],
    q=5,
    labels=['F', 'D', 'C', 'B', 'A'],
)

print(df_before)
print(df_after)

# df_before
   student  score
0    Alice     85
1      Bob     92
2  Charlie     78
3     Dave     64
4      Eve     90
5    Frank     82
6    Grace     75
7    Helen     88

# df_after
                      (cut)         (qcut)
   student  score grade_bin grade_quantile
0    Alice     85         B              C
1      Bob     92         A              A
2  Charlie     78         C              D
3     Dave     64         D              F
4      Eve     90         B              A
5    Frank     82         B              C
6    Grace     75         C              F
7    Helen     88         B              B

12. Cumulative Sum, Cumulative Percentage

Cumulative sum and percentage per group.

Use case: Market share analysis: Percentage is useful for analyzing market share. By calculating the percentage of a particular category or product’s sales relative to the total sales in the market, you can assess its market dominance and track changes in market dynamics easily.

import numpy as np
import pandas as pd
import itertools

df_before = pd.DataFrame({
    'city': list(itertools.chain(*[
        ["chicago"] * 3,
        ["new_york"] * 4,
        ["san_francisco"] * 5,
    ])),
    'sales': list(itertools.chain(*[
        np.repeat(2, 3),
        np.repeat(3, 4),
        np.repeat(1, 5),
    ]))
})

# Add cumulative sum
df_after = df_before.copy()
df_after['csum'] = df_before.groupby(['city'])['sales'].cumsum()

# Add cumulative percentage
gb_obj = df_before.groupby('city')['sales']
df_after['cpct'] = round(gb_obj.cumsum() / gb_obj.transform('sum') * 100, 1)

print(df_before)
print(df_after)

# df_before
             city  sales
0         chicago      2
1         chicago      2
2         chicago      2
3        new_york      3
4        new_york      3
5        new_york      3
6        new_york      3
7   san_francisco      1
8   san_francisco      1
9   san_francisco      1
10  san_francisco      1
11  san_francisco      1

# df_after
             city  sales  csum   cpct
0         chicago      2     2   33.3
1         chicago      2     4   66.7
2         chicago      2     6  100.0
3        new_york      3     3   25.0
4        new_york      3     6   50.0
5        new_york      3     9   75.0
6        new_york      3    12  100.0
7   san_francisco      1     1   20.0
8   san_francisco      1     2   40.0
9   san_francisco      1     3   60.0
10  san_francisco      1     4   80.0
11  san_francisco      1     5  100.0

13. nuniques and describe

Just some exploratory data analysis (EDA) stuff. Print out the number of dinstinct values, 5 numbers ([0, 25, 50, 75, 100] percentile) of the columns.

import pandas as pd

# since pandas 0.19.2
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

print(df.nunique())
print(df.describe())

# df.nunique()
sepal_length    35
sepal_width     23
petal_length    43
petal_width     22
species          3
dtype: int64

# df.describe()
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

14. Crosstab

Crosstab function (syntax: pd.crosstab(df["col_a"), df["col_b"])) is typically used when we are analyzing the relationship between two or more categorical variables. It creates a cross-tabulation or contingency table that summarizes the frequency or occurrence of different categories across multiple variables.

import pandas as pd

df_before = pd.DataFrame({
    'gender': ['male', 'male', 'female', 'female', 'male', 'female', 'male', 'female'],
    'education': ['high school', 'college', 'college', 'graduate', 'high school', 'graduate', 'college', 'graduate'],
    'score': [1, 5, 6, 8, 3, 9, 10, 3]
})

df_after = pd.crosstab(df_before['gender'], df_before['education']).reset_index()

print(df_before)
print(df_after)

# df_before
   gender    education  score
0    male  high school      1
1    male      college      5
2  female      college      6
3  female     graduate      8
4    male  high school      3
5  female     graduate      9
6    male      college     10
7  female     graduate      3

# df_after
education  gender  college  graduate  high school
0          female        1         3            0
1            male        2         0            2

15. Remap Column Value with `map`

Map values in the columns according to an input dictionary.

Ref: pandas.Series.map.html

import pandas as pd

df_before = pd.DataFrame({
    'id':    [1,   2,   3,   4,   5,   26 ],
    'fruit': ["a", "b", "c", "d", "e", "z"],
})

m_fruit = {
    "a": "apple",
    "b": "banana",
    "c": "coconut",
    "d": "durian",
    "e": "elderberry",
}

# Replace column value with the mapping dictionary, and use fillna as default value
df_after = df_before.copy()
df_after["fruit"] = df_after["fruit"].map(m_fruit).fillna("zucchini")

print(df_before)
print(df_after)

# df_before
   id fruit
0   1     a
1   2     b
2   3     c
3   4     d
4   5     e
5  26     z

# df_after
   id       fruit
0   1       apple
1   2      banana
2   3     coconut
3   4      durian
4   5  elderberry
5  26    zucchini

16. Percentage of non-zero and na Values

Check no. of missing data in a Dataframe.

Ref: Percentage of missing values

import pandas as pd
import numpy as np

df_before = pd.DataFrame({
    'col_1': [1, 2, 3, 0, 5, np.nan],
    'col_2': ["a", "b", "c", None, "e", None],
    'col_3': ["x", "x", "x", "x", "x", "x"],
})

# Method 1
df_after = ((df_before.isnull() | df_before.isna()).sum() / df_before.index.size) \
           .round(2) \
           .to_frame()

print(df_before)
print(df_after)

# Method 2
print(df_before.isnull().mean().round(2))

# df_before
   col_1 col_2 col_3
0    1.0     a     x
1    2.0     b     x
2    3.0     c     x
3    0.0  None     x
4    5.0     e     x
5    NaN  None     x

# df_after
          0
col_1  0.17
col_2  0.33
col_3  0.00

17. Find a Specific Table from a Webpage

This one come to me during research. Seems pandas added a simple way to read HTML tables directly with the pd.read_html function.

# Ref: https://www.geeksforgeeks.org/scraping-wikipedia-table-with-pandas-using-read_html
import pandas as pd

# Remarks: need lxml package
df = pd.read_html('https://en.wikipedia.org/wiki/Demographics_of_India',
                  match='Population distribution by states/union territories',
                  )
print(df[0].head())

  Rank        State/UT  Population[59]  ...  Urban[60]  Area[61] (km2)  Density (per km2)
0    1   Uttar Pradesh       199812341  ...   44470455          240928                828
1    2     Maharashtra       112374333  ...   50827531          307713                365
2    3           Bihar       104099452  ...   11729609           94163               1102
3    4     West Bengal        91276115  ...   29134060           88752               1030
4    5  Madhya Pradesh        72626809  ...   20059666          308245                236

18. Sampling

Sampling is a common technique used in data analysis and statistics to make inferences about a larger population based on a subset of the data, it is also useful when you want to draw a representable size of data (sampled dataset) to do some quick analysis and make inferences about the larger population.

Ref: pandas.DataFrame.sample.html

import pandas as pd
import numpy as np

np.random.seed(123)

# Generate dummy dataframe with Gaussian distribution
mu, sigma, size = 0, 1, 100
dummy_data = np.random.normal(loc=mu, scale=sigma, size=size)
df_before = pd.DataFrame({'value': dummy_data})

# Sample without replacement
df_after = df_before.sample(
    n=10,
    replace=False,
).reset_index(drop=True)

print(df_before)
print(df_after)

# df_before
       value
0  -1.085631
1   0.997345
2   0.282978
3  -1.506295
4  -0.578600
..       ...
95  1.031114
96 -1.084568
97 -1.363472
98  0.379401
99 -0.379176

# df_after (Sampling without replacement)
      value
0 -0.443982
1  0.737369
2 -1.085631
3  0.283627
4 -1.605963
5 -1.253881
6 -0.638902
7  1.491390
8  1.177862
9  1.031114

Final Thoughts

The post is longer than I thought. Hope the examples can make your life as a data scientist eaiser and you can picked up some new skills from it.

Happy Coding!

Recommended Readings

Hosting a Keyword Extraction Model with Flask and FastAPI

The topic of ML deployment is rarely discussed when machine learning is taught. This post is about how to host a simple keyword extraction model with Flask and FastAPI.

Getting HKEX data with Quandl in Python

Getting HKEX data with Quandl in Python. Historical daily HKEX data using API. Stock exchange in Yahoo Finance Hong Kong.

Reference

Photo by Stone Wang on Unsplash