Machine Learning Needs More Generators

I’ve spent the last two days cleaning up some research machine learning code that blew up when I tried applying it to my own data due to memory constraints. This post is about the anti-pattern that caused this, how I fixed it, and how you can avoid it too.

The INDRA Lab has been collaborating with my old PhD cellmate Daniel Domingo-Fernández and his master’s student Helena Balabin on cross-modal transformers. We’ve jointly pre-trained a transformer (STonKGs) on knowledge graph embeddings on the INDRA database and the associated evidence text for each triple.

We then fine-tuned models for various downstream tasks, including one for rating the correctness of a given statement (relevant since most are from large-scale text mining). We’re very interested in comparing this to INDRA’s belief system that will be described in an upcoming publication from our group.

I wanted to apply this fine-tuned model to the self-updating models in EMMAA, but ran into some memory errors since the code was building up big lists, converting the lists into DataFrames using pandas, then exporting to disk as a TSV. The solution is to use generator functions (that create iterable for use with for loops) and write the results to a file directly. The basic anti-pattern looks like this:

import pandas as pd


def f(df: pd.DataFrame) -> pd.DataFrame:
    new_rows = []
    for _, row in df.iterrows():
        new_row = {
            ...  # somehow build up a new row, as a dictionary
        }
        new_rows.append(new_row)
    return pd.DataFrame(new_rows)

There are two problems with this:

The input must be completely built up before calling f()
The output must be completely built up before returning

This compounds for every function f(), g(), h(), and so on that takes in the result from the last dataframe transformation, since they all have to keep everything in memory. In STonKGs, the first transformation is to look up the embeddings for a given source/target/evidence triple from both the knowledge graph embedding and the pre-trained BERT language model. The second transformation is to pre-process the embeddings. The third is to actually apply the STonKGs model to jointly embed them. The fourth is to apply the fine-tuned model. Each of these happens at varying speeds, but are unfortunately decoupled.

Python has really powerful tool for using ideas from functional programming in an approachable way that allows us to solve this issue. Rather than building up a huge list, we can simply yield each piece so another function can consume them with a for loop. When I refactor code that looks like this, I make a second helper function that does the hard work, and try and maintain the original function’s interface by calling the helper function:

import pandas as pd


def f(df: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame(f_helper(df))


def f_helper(df: pd.DataFrame) -> Iterable[Mapping[Any, Any]]:
    # new_rows = []  # we don't need this anymore!
    for _, row in df.iterrows():
        new_row = {
            ...  # somehow build up a new row, as a dictionary
        }
        yield new_row

If you’re not familiar with yield, here are a two videos to get you thinking aboout how to use loops like a Pythonista:

Now that we have f_helper, we solved issue #2. The solution to issue #1 is to have functions that take in only the parts that are needed to build up each new row. This means you should accept an iterable, and have the high level function slice up the dataframe or do whatever pre-procesing is necessary first:

import pandas as pd


def f(df: pd.DataFrame) -> pd.DataFrame:
    it = (row for _, row in df.iterrows())
    return pd.DataFrame(f_helper(it))


def f_helper(rows: Iterable[Mapping[Any, Any]]) -> Iterable[Mapping[Any, Any]]:
    for row in rows:
        new_row = {
            ...  # somehow build up a new row, as a dictionary
        }
        yield new_row

With this refactoring, the code still does what it used to, but now you can think about how you might string together f_helper(), g_helper(), h_helper(), and so on directly, since they don’t need to get completely materialized as a list. If you do the composition of several functions that take in iterables and yield stuff (i.e., they return an iterable of the stuff that gets yielded), then you don’t have to worry about running out of memory since it only needs to have the results of one set of transformations from the compositions of all functions at a time. In my case, I just printed the results to a file and then it no longer needed to be in memory.

This was a quick blog post I wrote while the code I updated was running. The full PR I did can be found here for reference. Feel free to chime in on that PR if you have some questions about how I did this or get in touch with any of the contact info on the bottom of my blog.