Hello! My name is Dmytro Samchuk, I’m a Machine Learning Engineer at Levi9, and I develop projects based on ML and Data Science technologies.
During my 5 years in commercial development, I have worked with various data analysis libraries, including Pandas. In this article, I share my top 10 things that I think are useful to know when working with this library. I relied solely on my experience, so some observations are more important than others, and you may already know something. I hope that everyone will find at least one discovery for themselves.
Categorical data type
By default, columns with a limited number of variants (categories) are read as object types. From the point of view of memory usage, this is not quite what we would like – having an index of such a column and using only a reference to the object instead of the fundamental values “under the hood” is desirable.
Fortunately, this type of data exists in pandas, namely CategoricalDtype.
Let’s look at an example. Recently, I was preparing a dataset of triplets for metric learning. For internal reasons, I made a large data frame of paths to images, each row having 3 columns: anchor, positive, and negative.
The input data frame looks like this:
+----------+------------------------+
| class | filename |
+----------+------------------------+
| Bathroom | Bathroom\bath_1.jpg |
| Bathroom | Bathroom\bath_100.jpg |
| Bathroom | Bathroom\bath_1003.jpg |
| Bathroom | Bathroom\bath_1004.jpg |
| Bathroom | Bathroom\bath_1005.jpg |
+----------+------------------------+
The desired view:
+------------------------+------------------------+----------------------------+
| anchor | positive | negative |
+------------------------+------------------------+----------------------------+
| Bathroom\bath_1.jpg | Bathroom\bath_100.jpg | Dinning\din_540.jpg |
| Bathroom\bath_100.jpg | Bathroom\bath_1003.jpg | Dinning\din_1593.jpg |
| Bathroom\bath_1003.jpg | Bathroom\bath_1004.jpg | Bedroom\bed_329.jpg |
| Bathroom\bath_1004.jpg | Bathroom\bath_1005.jpg | Livingroom\living_1030.jpg |
| Bathroom\bath_1005.jpg | Bathroom\bath_1007.jpg | Bedroom\bed_1240.jpg |
+------------------------+------------------------+----------------------------+
The value of the `filename` column will be duplicated often, so it makes sense to make it a categorical column.
For example, let’s load an already converted dataset and see the difference in memory (with and without using a categorical type):
# with categories
triplets.info(memory_usage="deep")
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 anchor 525000 non-null category
# 1 positive 525000 non-null category
# 2 negative 525000 non-null category
# dtypes: category(3)
# memory usage: 4.6 MB
# without categories
triplets_raw.info(memory_usage="deep")
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 anchor 525000 non-null object
# 1 positive 525000 non-null object
# 2 negative 525000 non-null object
# dtypes: object(3)
# memory usage: 118.1 MB
As you can see, the difference is significant. Moreover, it will increase nonlinearly with the number of duplications.
Expanding columns into rows
Let’s take a look at the following dataset from the Kaggle competition. Namely, the file ‘census_starter.csv’, which looks like this:
+-------+-------------+-------------+-------------+-------------+-------------+
| cfips | pct_bb_2017 | pct_bb_2018 | pct_bb_2019 | pct_bb_2020 | pct_bb_2021 |
+-------+-------------+-------------+-------------+-------------+-------------+
| 1001 | 76.6 | 78.9 | 80.6 | 82.7 | 85.5 |
| 1003 | 74.5 | 78.1 | 81.8 | 85.1 | 87.9 |
| 1005 | 57.2 | 60.4 | 60.5 | 64.6 | 64.6 |
| 1007 | 62.0 | 66.1 | 69.2 | 76.1 | 74.6 |
| 1009 | 65.8 | 68.5 | 73.0 | 79.6 | 81.0 |
| 1011 | 49.4 | 58.9 | 60.1 | 60.6 | 59.4 |
| 1013 | 58.2 | 62.1 | 64.6 | 73.6 | 76.3 |
| 1015 | 71.0 | 73.0 | 75.1 | 79.8 | 81.6 |
| 1017 | 62.8 | 66.5 | 69.4 | 74.5 | 77.1 |
| 1019 | 67.5 | 68.6 | 70.7 | 75.0 | 76.7 |
| 1021 | 56.6 | 61.8 | 68.7 | 74.6 | 78.3 |
+-------+-------------+-------------+-------------+-------------+-------------+
These columns are not used in practice because it would be much better to have one column, “year” and “pct_bb“, and five rows with the corresponding values, as shown in the table below:
+-------+------+--------+
| cfips | year | pct_bb |
+-------+------+--------+
| 1001 | 2017 | 76.6 |
| 1001 | 2018 | 78.9 |
| 1001 | 2019 | 80.6 |
| 1001 | 2020 | 82.7 |
| 1001 | 2021 | 85.5 |
| 1003 | 2017 | 74.5 |
| 1003 | 2018 | 78.1 |
| 1003 | 2019 | 81.8 |
| 1003 | 2020 | 85.1 |
| 1003 | 2021 | 87.9 |
+-------+------+--------+
This result can be achieved differently, but the most used method is melt().
cols = sorted([col for col in original_df.columns \
if col.startswith("pct_bb")])
df = original_df[(["cfips"] + cols)]
df = df.melt(id_vars="cfips",
value_vars=cols,
var_name="year",
value_name=feature).sort_values(by=["cfips", "year"])
Similar functionality is also achieved using wide_to_long().
apply() is slow, but there is no way out
You probably already know it’s better not to go this way because it iterates over each line and calls the specified method. But often, there are no alternatives – you can use packages like swifter or pandarallel, which parallelize this process.
""" Swifter way """
import pandas as pd
import swifter
def target_function(row):
return row * 10
def traditional_way(data):
data['out'] = data['in'].apply(target_function)
def swifter_way(data):
data['out'] = data['in'].swifter.apply(target_function)
""" Pandarallel way """
import pandas as pd
from pandarallel import pandarallel
def target_function(row):
return row * 10
def traditional_way(data):
data['out'] = data['in'].apply(target_function)
def pandarallel_way(data):
pandarallel.initialize()
data['out'] = data['in'].parallel_apply(target_function)
Of course, if you have a cluster and are interested in distributed computing of large datasets, it is better to use dask or pyspark.
Empty values, int, Int64
Here’s a terse note: the standard int type cannot have empty values. Be careful because, by default, it will float.
If your scheme provides for empty values in an int field, use Int64. In this case, instead of converting to float, you will get stable integers and pandas.NA.
I had a case when for years, one column was indexed as int (never had empty values), when suddenly, one day, due to a backend error, open values appeared there.
Should I save it to CSV or Parquet?
The answer to this question is unequivocal: if nothing prevents you, then parquet. It saves data types, and you no longer need to pass dtypes when reading.
In addition, the file takes up very little disk space. For example, I saved the same dataset in csv (no compression), zip, gzip, and parquet formats. This is especially important when you have large data sets in the cloud and pay for the amount of disk space consumed.
+------------------------+---------+
| file | size |
+------------------------+---------+
| triplets_525k.csv | 38.4 MB |
| triplets_525k.csv.gzip | 4.3 MB |
| triplets_525k.csv.zip | 4.5 MB |
| triplets_525k.parquet | 1.9 MB |
+------------------------+---------+
However, it would be best if you had additional packages like Pyarrow or Fastparquet to read it. In addition, you can’t just pass the nrows argument to read the first N rows. This is done slightly more complicated, for example, as shown here.
At the same time, I want to note that I do not know all the limitations of using parquet files. Please reach out and comment your thoughts.
Relative value_counts()
For a while, I looked for relative frequencies slightly more complicatedly – by explicitly grouping, counting, and dividing by the total.
However, if you pay attention to the arguments of this method, you can do it much more simply while deciding whether to consider empty values.
df = pd.DataFrame({"a": [1, 2, None], "b": [4., 5.1, 14.02]})
df["a"] = df["a"].astype("Int64")
print(df.info())
print(df["a"].value_counts(normalize=True, dropna=False),
df["a"].value_counts(normalize=True, dropna=True), sep="\n\n")
# # Column Non-Null Count Dtype
#--- ------ -------------- -----
# 0 a 2 non-null Int64
# 1 b 3 non-null float64
#1 0.333333
#2 0.333333
# 0.333333
#Name: a, dtype: Float64
#1 0.5
#2 0.5
#Name: a, dtype: Float64
Modin for experiments
Perhaps this item should have been put in the first place. This library is still popular, mainly because it requires only one line of change in your script/notebook, namely importing the panda’s library to modin.pandas.
I don’t currently use this solution in “prod” because at least there is pyspark. But for local experiments or data analysis, I think it’s an excellent idea.
Here is the documentation on how to use it.
Named groups and the extract method
You can often find complex but structured strings where you need to extract parts and make separate columns.
Let’s take a look at an example with addresses:
addresses_structured = pd.Series([
"Україна, Полтавська обл, м. Полтава, вул. Полтавська, 0",
'Україна, Київська обл, м. Вишневе, вул. Вишнівська, 1',
'Україна, Львівська обл, м. Львів, вул. Львівська, 999'])
regex_structured = (
r"(?P[А-Яа-яїі]+),\s"
r"(?P[А-Яа-яїі\']+(?:\sобл\.*)),\s"
r"(?P(?:[смт]+\.\s*)[А-Яа-яїі\']+),\s"
r"(?P(?:вул+\.\s*)[А-Яа-яїі\']+),\s"
r"(?P\d+)")
addresses_structured.str.extract(regex_structured)
We have the following result:
+---------+----------------+------------+-----------------+------------+
| country | region | city | street | street_num |
+---------+----------------+------------+-----------------+------------+
| Ukraine | Poltavska | Poltava | Poltavska | 0 |
| Ukraine | Kyivska | Vyshneve | Vyshnivska | 1 |
| Ukraine | Lvivska | Lviv | Lvivska | 999 |
+---------+----------------+------------+-----------------+------------+
Reading from the clipboard
The next trick is not so much important as it is fun. I’ve used it a few times to add a table from a pdf file to my analysis. A typical solution would be to copy everything you need, paste it into a table editor, export it to CSV, and read it in Pandas.
However, there is a more straightforward solution. Namely, pd.read_from_clipboard(). Just copy the data you need and run it – like this:
df_cb = pd.read_clipboard(header=None)
df_cb.head()
If you have a table with groupings, you can read the buffer into different variables and then add them together.
I recommend that you remember that this option exists. You can also export to a buffer using the to_clipboard() method. This is how I inserted the tables in this article.
Another way to "unpack" tags in a multi-label style
Let’s say we have a dataset like this, which is quite typical:
+---+---+----------------+
| a | b | cat |
+---+---+----------------+
| 1 | 4 | ['foo', 'bar'] |
| 2 | 5 | ['foo'] |
| 3 | 6 | ['qux'] |
+---+---+----------------+
To train a model or analyze data, we need to make something like this out of it:
+---+---+---------+---------+---------+
| a | b | cat_bar | cat_foo | cat_qux |
+---+---+---------+---------+---------+
| 1 | 4 | 1 | 1 | 0 |
| 2 | 5 | 0 | 1 | 0 |
| 3 | 6 | 0 | 0 | 1 |
+---+---+---------+---------+---------+
I know of 3 different ways to achieve this result, namely:
- Iteratively making a pd.Series from each array
- The vectorized method
- Use sklearn. Preprocessing.Multilabelbinarizer()
I will give these 3 examples and see how fast they work.
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame({"a": [1, 2, 3],
"b": [4, 5, 6],
"category": [["foo", "bar"], ["foo"], ["qux"]]})
# increase the number of lines in the dataframe
df = pd.concat([df]*10000, ignore_index=True)
def dummies_series_apply(df):
return df.join(df['category'].apply(pd.Series)
.stack()
.str.get_dummies()
.groupby(level=0)
.sum())
.drop("category", axis=1)
def dummies_vectorized(df):
return pd.get_dummies(df.explode("category"), prefix="cat")
.groupby(["a", "b"])
.sum()
.reset_index()
def sklearn_mlb(df):
mlb = MultiLabelBinarizer()
return df.join(pd.DataFrame(mlb.fit_transform(df['category']),
columns=mlb.classes_))
.drop("category", axis=1)
%timeit dummies_series_apply(df.copy())
# 3.27 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dummies_vectorized(df.copy())
# 13.3 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sklearn_mlb(df.copy())
# 16 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, the first method, which is quite common among the answers on StackOverflow, gives a prolonged result. I do not mention that, in this case, the laptop didn’t even measure the execution time for more than 1 cycle.
But the other 2 options are acceptable.
Conclusion
I hope that every reader will find something new in this selection. Perhaps the most important conclusion is that you should use apply() less and give preference to vectorized operations.
Also, remember that more exciting alternatives exist to store datasets in CSV. And don’t forget about the categorical type because it will save you a lot of memory.