Hello! My name is Dmytro Samchuk, I’m a Machine Learning Engineer at Levi9, and I develop projects based on ML and Data Science technologies.

During my 5 years in commercial development, I have worked with various data analysis libraries, including Pandas. In this article, I share my top 10 things that I think are useful to know when working with this library. I relied solely on my experience, so some observations are more important than others, and you may already know something. I hope that everyone will find at least one discovery for themselves.

Categorical data type

By default, columns with a limited number of variants (categories) are read as object types. From the point of view of memory usage, this is not quite what we would like – having an index of such a column and using only a reference to the object instead of the fundamental values “under the hood” is desirable.

Fortunately, this type of data exists in pandas, namely CategoricalDtype.

Let’s look at an example. Recently, I was preparing a dataset of triplets for metric learning. For internal reasons, I made a large data frame of paths to images, each row having 3 columns: anchor, positive, and negative.

The input data frame looks like this:

				
					+----------+------------------------+
|  class   |        filename        |
+----------+------------------------+
| Bathroom | Bathroom\bath_1.jpg    |
| Bathroom | Bathroom\bath_100.jpg  |
| Bathroom | Bathroom\bath_1003.jpg |
| Bathroom | Bathroom\bath_1004.jpg |
| Bathroom | Bathroom\bath_1005.jpg |
+----------+------------------------+

The desired view:

				
					+------------------------+------------------------+----------------------------+
|         anchor         |        positive        |          negative          |
+------------------------+------------------------+----------------------------+
| Bathroom\bath_1.jpg    | Bathroom\bath_100.jpg  | Dinning\din_540.jpg        |
| Bathroom\bath_100.jpg  | Bathroom\bath_1003.jpg | Dinning\din_1593.jpg       |
| Bathroom\bath_1003.jpg | Bathroom\bath_1004.jpg | Bedroom\bed_329.jpg        |
| Bathroom\bath_1004.jpg | Bathroom\bath_1005.jpg | Livingroom\living_1030.jpg |
| Bathroom\bath_1005.jpg | Bathroom\bath_1007.jpg | Bedroom\bed_1240.jpg       |
+------------------------+------------------------+----------------------------+

The value of the `filename` column will be duplicated often, so it makes sense to make it a categorical column.

For example, let’s load an already converted dataset and see the difference in memory (with and without using a categorical type):

				
					# with categories
triplets.info(memory_usage="deep")

#   Column    Non-Null Count   Dtype  
# ---  ------    --------------   -----  
#  0   anchor    525000 non-null  category
#  1   positive  525000 non-null  category
#  2   negative  525000 non-null  category
# dtypes: category(3)
# memory usage: 4.6 MB

# without categories
triplets_raw.info(memory_usage="deep")

#   Column    Non-Null Count   Dtype
# ---  ------    --------------   -----
#  0   anchor    525000 non-null  object
#  1   positive  525000 non-null  object
#  2   negative  525000 non-null  object
# dtypes: object(3)
# memory usage: 118.1 MB

As you can see, the difference is significant. Moreover, it will increase nonlinearly with the number of duplications.

Expanding columns into rows

Let’s take a look at the following dataset from the Kaggle competition. Namely, the file ‘census_starter.csv’, which looks like this:

				
					+-------+-------------+-------------+-------------+-------------+-------------+
| cfips | pct_bb_2017 | pct_bb_2018 | pct_bb_2019 | pct_bb_2020 | pct_bb_2021 |
+-------+-------------+-------------+-------------+-------------+-------------+
|  1001 |        76.6 |        78.9 |        80.6 |        82.7 |        85.5 |
|  1003 |        74.5 |        78.1 |        81.8 |        85.1 |        87.9 |
|  1005 |        57.2 |        60.4 |        60.5 |        64.6 |        64.6 |
|  1007 |        62.0 |        66.1 |        69.2 |        76.1 |        74.6 |
|  1009 |        65.8 |        68.5 |        73.0 |        79.6 |        81.0 |
|  1011 |        49.4 |        58.9 |        60.1 |        60.6 |        59.4 |
|  1013 |        58.2 |        62.1 |        64.6 |        73.6 |        76.3 |
|  1015 |        71.0 |        73.0 |        75.1 |        79.8 |        81.6 |
|  1017 |        62.8 |        66.5 |        69.4 |        74.5 |        77.1 |
|  1019 |        67.5 |        68.6 |        70.7 |        75.0 |        76.7 |
|  1021 |        56.6 |        61.8 |        68.7 |        74.6 |        78.3 |
+-------+-------------+-------------+-------------+-------------+-------------+

These columns are not used in practice because it would be much better to have one column, “year” and “pct_bb“, and five rows with the corresponding values, as shown in the table below:

				
					+-------+------+--------+
| cfips | year | pct_bb |
+-------+------+--------+
|  1001 | 2017 |   76.6 |
|  1001 | 2018 |   78.9 |
|  1001 | 2019 |   80.6 |
|  1001 | 2020 |   82.7 |
|  1001 | 2021 |   85.5 |
|  1003 | 2017 |   74.5 |
|  1003 | 2018 |   78.1 |
|  1003 | 2019 |   81.8 |
|  1003 | 2020 |   85.1 |
|  1003 | 2021 |   87.9 |
+-------+------+--------+

This result can be achieved differently, but the most used method is melt().

				
					cols = sorted([col for col in original_df.columns \
              if col.startswith("pct_bb")])
df = original_df[(["cfips"] + cols)]
df = df.melt(id_vars="cfips",
             value_vars=cols,
             var_name="year",
             value_name=feature).sort_values(by=["cfips", "year"])

Similar functionality is also achieved using wide_to_long().

apply() is slow, but there is no way out

You probably already know it’s better not to go this way because it iterates over each line and calls the specified method. But often, there are no alternatives – you can use packages like swifter or pandarallel, which parallelize this process.

				
					""" Swifter way """

import pandas as pd
import swifter

def target_function(row):
    return row * 10

def traditional_way(data):
    data['out'] = data['in'].apply(target_function)

def swifter_way(data):
    data['out'] = data['in'].swifter.apply(target_function)

				
					""" Pandarallel way """

import pandas as pd
from pandarallel import pandarallel

def target_function(row):
    return row * 10

def traditional_way(data):
    data['out'] = data['in'].apply(target_function)

def pandarallel_way(data):
    pandarallel.initialize()
    data['out'] = data['in'].parallel_apply(target_function)

Of course, if you have a cluster and are interested in distributed computing of large datasets, it is better to use dask or pyspark.

Empty values, int, Int64

Here’s a terse note: the standard int type cannot have empty values. Be careful because, by default, it will float.

If your scheme provides for empty values in an int field, use Int64. In this case, instead of converting to float, you will get stable integers and pandas.NA.

I had a case when for years, one column was indexed as int (never had empty values), when suddenly, one day, due to a backend error, open values appeared there.

Should I save it to CSV or Parquet?

The answer to this question is unequivocal: if nothing prevents you, then parquet. It saves data types, and you no longer need to pass dtypes when reading.

In addition, the file takes up very little disk space. For example, I saved the same dataset in csv (no compression), zip, gzip, and parquet formats. This is especially important when you have large data sets in the cloud and pay for the amount of disk space consumed.

				
					+------------------------+---------+
|        file            |  size   |
+------------------------+---------+
| triplets_525k.csv      | 38.4 MB |
| triplets_525k.csv.gzip |  4.3 MB |
| triplets_525k.csv.zip  |  4.5 MB |
| triplets_525k.parquet  |  1.9 MB |
+------------------------+---------+

However, it would be best if you had additional packages like Pyarrow or Fastparquet to read it. In addition, you can’t just pass the nrows argument to read the first N rows. This is done slightly more complicated, for example, as shown here.

At the same time, I want to note that I do not know all the limitations of using parquet files. Please reach out and comment your thoughts.

Relative value_counts()

For a while, I looked for relative frequencies slightly more complicatedly – by explicitly grouping, counting, and dividing by the total.

However, if you pay attention to the arguments of this method, you can do it much more simply while deciding whether to consider empty values.

				
					df = pd.DataFrame({"a": [1, 2, None], "b": [4., 5.1, 14.02]})
df["a"] = df["a"].astype("Int64")
print(df.info())
print(df["a"].value_counts(normalize=True, dropna=False),
      df["a"].value_counts(normalize=True, dropna=True), sep="\n\n")

# #   Column  Non-Null Count  Dtype 
#---  ------  --------------  ----- 
# 0   a       2 non-null      Int64 
# 1   b       3 non-null      float64

#1       0.333333
#2       0.333333
#<NA>    0.333333
#Name: a, dtype: Float64

#1    0.5
#2    0.5
#Name: a, dtype: Float64

Modin for experiments

Perhaps this item should have been put in the first place. This library is still popular, mainly because it requires only one line of change in your script/notebook, namely importing the panda’s library to modin.pandas.

I don’t currently use this solution in “prod” because at least there is pyspark. But for local experiments or data analysis, I think it’s an excellent idea.

Here is the documentation on how to use it.

Named groups and the extract method

You can often find complex but structured strings where you need to extract parts and make separate columns.

Let’s take a look at an example with addresses:

				
					addresses_structured = pd.Series([
     "Україна, Полтавська обл, м. Полтава, вул. Полтавська, 0",
     'Україна, Київська обл, м. Вишневе, вул. Вишнівська, 1',
     'Україна, Львівська обл, м. Львів, вул. Львівська, 999'])

regex_structured = (
    r"(?P<country>[А-Яа-яїі]+),\s"
    r"(?P<region>[А-Яа-яїі\']+(?:\sобл\.*)),\s"
    r"(?P<city>(?:[смт]+\.\s*)[А-Яа-яїі\']+),\s"
    r"(?P<street>(?:вул+\.\s*)[А-Яа-яїі\']+),\s"
    r"(?P<street_num>\d+)")

addresses_structured.str.extract(regex_structured)

We have the following result:

				
					+---------+----------------+------------+-----------------+------------+
| country |     region     |    city    |     street      | street_num |
+---------+----------------+------------+-----------------+------------+
| Ukraine | Poltavska      | Poltava    | Poltavska       |          0 |
| Ukraine | Kyivska        | Vyshneve   | Vyshnivska      |          1 |
| Ukraine | Lvivska        | Lviv       | Lvivska         |        999 |
+---------+----------------+------------+-----------------+------------+

Reading from the clipboard

The next trick is not so much important as it is fun. I’ve used it a few times to add a table from a pdf file to my analysis. A typical solution would be to copy everything you need, paste it into a table editor, export it to CSV, and read it in Pandas.

However, there is a more straightforward solution. Namely, pd.read_from_clipboard(). Just copy the data you need and run it – like this:

				
					df_cb = pd.read_clipboard(header=None)
df_cb.head()

If you have a table with groupings, you can read the buffer into different variables and then add them together.

I recommend that you remember that this option exists. You can also export to a buffer using the to_clipboard() method. This is how I inserted the tables in this article.

Another way to "unpack" tags in a multi-label style

Let’s say we have a dataset like this, which is quite typical:

				
					+---+---+----------------+
| a | b |      cat       |
+---+---+----------------+
| 1 | 4 | ['foo', 'bar'] |
| 2 | 5 |        ['foo'] |
| 3 | 6 |        ['qux'] |
+---+---+----------------+

To train a model or analyze data, we need to make something like this out of it:

				
					+---+---+---------+---------+---------+
| a | b | cat_bar | cat_foo | cat_qux |
+---+---+---------+---------+---------+
| 1 | 4 |       1 |       1 |       0 |
| 2 | 5 |       0 |       1 |       0 |
| 3 | 6 |       0 |       0 |       1 |
+---+---+---------+---------+---------+

I know of 3 different ways to achieve this result, namely:

I will give these 3 examples and see how fast they work.

				
					import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame({"a": [1, 2, 3],
                   "b": [4, 5, 6],
                   "category": [["foo", "bar"], ["foo"], ["qux"]]})

# increase the number of lines in the dataframe
df = pd.concat([df]*10000, ignore_index=True)



def dummies_series_apply(df):
    return df.join(df['category'].apply(pd.Series)
                                 .stack()
                                 .str.get_dummies()
                                 .groupby(level=0)
                                 .sum())
             .drop("category", axis=1)

def dummies_vectorized(df):
    return pd.get_dummies(df.explode("category"), prefix="cat")
             .groupby(["a", "b"])
             .sum()
             .reset_index()

def sklearn_mlb(df):
    mlb = MultiLabelBinarizer()
    return df.join(pd.DataFrame(mlb.fit_transform(df['category']),
                                columns=mlb.classes_))
             .drop("category", axis=1)


%timeit dummies_series_apply(df.copy())
# 3.27 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit dummies_vectorized(df.copy())
# 13.3 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sklearn_mlb(df.copy())
# 16 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, the first method, which is quite common among the answers on StackOverflow, gives a prolonged result. I do not mention that, in this case, the laptop didn’t even measure the execution time for more than 1 cycle.

But the other 2 options are acceptable.

Conclusion

I hope that every reader will find something new in this selection. Perhaps the most important conclusion is that you should use apply() less and give preference to vectorized operations.

Also, remember that more exciting alternatives exist to store datasets in CSV. And don’t forget about the categorical type because it will save you a lot of memory.

Written by:

Dmytro Samchuk,
ML Engineer
Levi9

Published:

7 April 2023

Blog post

6 Easy Ways to Boost Your Recruitment with ChatGPT

June 8, 2023

Blog post

Transform Your Business with Generative AI: Start Your Journey to Innovation and Efficiency Today

May 17, 2024

Blog post

From Data to Decisions: A Deep Dive into Observability Tools

March 12, 2024

Blog post

How Levi9’s AI solution streamlined fashion document processing

September 4, 2024

Blog post

Unlocking the Power of Python: A Guide to Efficient Data Workflows

July 22, 2022

Blog post

Computer Vision Scoops with Levi9’s Hackathon Winners

April 16, 2024

Blog post

Implementing the Recommendation System

December 19, 2023

Blog post

Lifting AI’s Sacred Veil

September 28, 2023

Blog post

Navigating the Challenges of Implementing Machine Learning in Business

September 26, 2022

Blog post

Smart Cities and Safer Roads: How AI and IoT Are Transforming Urban Transportation

June 17, 2024

10 Helpful Tips for Efficient Data Processing in Pandas

Categorical data type

Expanding columns into rows

apply() is slow, but there is no way out

Empty values, int, Int64

Should I save it to CSV or Parquet?

Relative value_counts()

Modin for experiments

Named groups and the extract method

Reading from the clipboard

Another way to "unpack" tags in a multi-label style

Conclusion

In this article:

Related posts

6 Easy Ways to Boost Your Recruitment with ChatGPT

Transform Your Business with Generative AI: Start Your Journey to Innovation and Efficiency Today

From Data to Decisions: A Deep Dive into Observability Tools

How Levi9’s AI solution streamlined fashion document processing

Unlocking the Power of Python: A Guide to Efficient Data Workflows

Computer Vision Scoops with Levi9’s Hackathon Winners

Implementing the Recommendation System

Lifting AI’s Sacred Veil

Navigating the Challenges of Implementing Machine Learning in Business

Smart Cities and Safer Roads: How AI and IoT Are Transforming Urban Transportation

Contact us

Explore

Industries

10 Helpful Tips for Efficient Data Processing in Pandas

Categorical data type

Expanding columns into rows

apply() is slow, but there is no way out

Empty values, int, Int64

Should I save it to CSV or Parquet?

Relative value_counts()

Modin for experiments

Named groups and the extract method

Reading from the clipboard

Another way to "unpack" tags in a multi-label style

Conclusion

In this article:

Related posts

6 Easy Ways to Boost Your Recruitment with ChatGPT

Transform Your Business with Generative AI: Start Your Journey to Innovation and Efficiency Today​

From Data to Decisions: A Deep Dive into Observability Tools

How Levi9’s AI solution streamlined fashion document processing

Unlocking the Power of Python: A Guide to Efficient Data Workflows

Computer Vision Scoops with Levi9’s Hackathon Winners

Implementing the Recommendation System

Lifting AI’s Sacred Veil

Navigating the Challenges of Implementing Machine Learning in Business

Smart Cities and Safer Roads: How AI and IoT Are Transforming Urban Transportation

Transform Your Business with Generative AI: Start Your Journey to Innovation and Efficiency Today