One of the cardinal sins in computational science is to hard code a file path in your analysis. This post is a guide to reorganizing your code to avoid this and then to generate a command line interface (CLI) using click.
The best way around this is to make all of your code live inside a function that takes a file path as an argument. Here’s an example of some sinful code:
# sinful_analysis.py import pandas as pd df = pd.read_csv('/Users/cthoyt/data/example.tsv') analysis = do_analysis(df) save_analysis(analysis, '/Users/cthoyt/data/analysis.tsv')
Here’s the same code, but enlightened:
# enlightened_analysis.py import pandas def do_enlightened_analysis(input_path, output_path): df = pd.read_csv(input_path) analysis = do_analysis(df) save_analysis(analysis, output_path)
The enlightened code doesn’t contain any references to the file paths on which you’re doing analysis. In fact, the enlightened code can’t even be run directly without passing the file paths as variables. This pattern gets you in the mindset of separating the code from the configuration for running the code. Again, this is important because the file path will change depending on who’s running it, if you decide to do spring cleaning on your hard drive, or if you get new files.
There are lots of ways you might pass the input and output paths into this function. The
most obvious, since you’ve probably read my previous blog post
and you’re now a packaging master, is to
enlightened_analysis and run it from the Python REPL. Another way would be to
make a one-off Python script whose job is to actually run the analysis (as opposed to this
example, which is creating the workflow to be run). Though this are both better than the
sinful analysis, it’s a problem since you have to manually interact with Python to run your code.
You’re likely familiar with using the CLI for
Wouldn’t it be terrible if you had to write a Python script
pip (like R makes you do with
install.packages(), ughhh!!). This is the
same visceral reaction you should have to having to make specific python code for an analysis.
Making your first CLI
As you might have guessed, the solution is to make a CLI. After making your function that does the hard work, the job of the CLI should just take care of getting the configuration (e.g., file paths) from the user and passing them to your functions that do the hard work.
CLIs should be very very short! If you’re putting lots of logic inside your CLI, then you should probably reconsider refactoring that logic into more generally reusable functions. Here’s an example of the hello world CLI:
# cli_simple.py if __name__ == '__main__': print('hello')
Then you run this python script with
python cli_simple.py. It’s boring. You didn’t need
to read this guide to do this. However, you might not know what
if __name__ == '__main__' is.
It turns out that all python files know their name and store it in the
__name__ variable when
they’re imported. However, if you are running a python file as a script then
set to the string
'__main__'. This allows you to make sure that the
print('hello') is only
ever run if the user is actually running the script as a CLI.
However, don’t be tempted to put lots of code in
if __name__ == '__main__'. You should always
make a function
main() where all of the code that runs the CLI, and just call it.
# cli_simple_2.py def main(): print('hello') if __name__ == '__main__': main()
You still run this script with
Another reason we introduced
main() is to take care of getting information from the user. That’s
click comes in. It makes functions do all sorts of magical things to get information
from command line arguments. All you have to do is use function decorator (something
starting with the
@ symbol) to annotate that the function is a
click.command(). If you’re not
already familiar with decorators, check this short video
or this long video.
pip install click in your shell then update your code to look like this:
# cli_simple_3.py import click @click.command() def main(): print('hello') if __name__ == '__main__': main()
You still run this script with
python cli_simple_3.py and it does exactly the same as the last one.
However, once your main function is a
click.command(), you can do all sorts of wonderful things.
The first is to pass arguments from the command line into the function. Update your code to look like
# cli_simple_4.py import click @click.command() @click.argument('text') def main(text): print(text) if __name__ == '__main__': main()
You probably notice that the call to
main() at the bottom does not include an argument
text. That’s because
click is decorating the original function in the meantime, which means
that the actual thing called
main is a function that takes no arguments in the end. This
is good, because click uses the extra decorators (e.g.
to figure out what to put in the arguments
from the function that we actually wrote. Notice that the
matches up to the variable name. That’s no coincidence.
You can now run this script from the command line with
python cli_simple_4.py. You’ll
see that it yells at you for forgetting the
text argument. Better not! Try again
python cli_simple_4.py "Hello World!" and you’ll be happy to see you’re now at
Hello World for CLIs. From here you can do all sorts of stuff which is all outlined in
click also automatically generates documentation for you, so it’s always possible to run
the command without arguments and with the
--help flag as in
python cli_simple_4.py --help.
It will give you information about all of the arguments, their types, and more.
CLIs in Package World
It’s my strong opinion that almost all code should be packaged, and the CLI is no exception.
To finish our original problem, we’ll create a python file
cli.py in the package
enlightened_analysis.py is and import our function from there. Then we’ll add
the right arguments to
click, pass them to the right place, and profit!
# cli.py import click from .enlightened_analysis import do_enlightened_analysis @click.command() @click.argument('input_path') @click.argument('output_path') def main(input_path, output_path): do_enlightened_analysis(input_path, output_path) if __name__ == '__main__': main()
If you’ve done it right, a the body of the
main() function for your CLI should very boring function.
Of course, there are other ways to organize your code, but this is a good
way to do it until you’re more comfortable.
However, now we’re living in package world. In this tutorial, I’ve skipped the explanation of
cli.py into a package. I’ll assume from here that
you’ve done this and named the package
superanalysis. If you’re not familiar with doing that,
check my previous blog post or
my tutorial on YouTube.
We don’t want to interact with this code by running it as a script with
python cli.py. Instead, we want to
interact with the code via the package. Further, if you
cd into the place where the code is and
python cli.py, you’ll get an import warning because relative imports don’t work
when you’re not in a Python package context. This error is a good thing - it’s a reminder that you
should always live in the packaged world.
The solution to the problem is to use the
-m flag in the
python CLI. Remember that
cli.py modules are in a package called
you should have also already installed).
You can now run the CLI using
python -m superanalysis.cli <input_path> <output_path>. This is
also going to set
'__main__' the same way as running it as a script, but you’re
in the python package context!
Vanity is a Virtue
-m can almost be used to run any python file inside your package as command line interface,
which means you should always wrap up code for the CLI in
if __name__ == '__main__' so it doesn’t
accidentally get run if the module is imported.
The exception is the
__init__.py files can’t be run as a module. If you were to write
python -m superanalysis, it wouldn’t run the
__init__.py file as a script and instead
would throw an error.
If you want to associate a CLI with the package , you need to make an additional file
__main__.py sitting next to
cli.py in the
superanalysis package. As an aside,
this also works in subpackages.
# __main__.py """Entrypoint module, in case you use `python -m superanalysis`. Why does this file exist, and why `__main__`? For more info, read: - https://www.python.org/dev/peps/pep-0338/ - https://docs.python.org/3/using/cmdline.html#cmdoption-m """ from .cli import main if __name__ == '__main__': main()
This python module simply reuses the main function we already wrote before. It can basically be copied verbatim from package to package, but don’t forget to change the first line to match yours! I like to copy it because it also has the information from the python docs on why it works.
Now, you can run
python -m superanalysis instead of
python -m superanalysis.cli.
We can also do one better. Wouldn’t it be nice to make a CLI function so we could just run
superanalysis <input_path> <output_path>? You’re in luck, because since we grouped all
of our code in a
main() function, we can make a small addition to the
entry points to tell
pip to automatically create a
superanalysis CLI in your shell
by doing the following:
[options.entry_points] console_scripts = superanalysis = superanalysis.cli:main
The left part is the name of the CLI that will be in your shell, then the right part has
the path to a module followed by a colon
: then the name of the function to be run.
That’s pretty much it! Now you can make beautiful command line interfaces. There’s one more topic that I think is worth noting at the end of this tutorial, and that’s to use command line groups. This allows you to organize subcommands in your CLI and import other CLIs from other modules. It would look like this:
# cli.py import click @click.group() def main(): pass # becuase this is a group, you don't actually need to do anything in it @main.command() # note that the main function now can assign commands def subcommand1(): print('hello world') @main.command() def subcommand2(): print('other greeting') @main.group() def subgroup(): pass @subgroup.command() def turtle(): print('you can go as deep as you want with subcommands') # You can include other CLIs from other modules in your package # to make everything much more unified from .my_other_module.cli import main as other_module_command main.add_command(other_module_command)
You can see that rather than using the
click.command() decorator, the
click.group() decorator. This means that it can be used to issue subcommands or
even subgroups! At the end, it was also used to combine CLIs from another part of the
same package. This is good if you have a package that does lots of things, but you
want a single unified CLI to access all of it. Just be careful with how the functions
are named (it doesn’t always have to be main) becuase if two are called the same thing
then there will be a name clash and one sub-command or sub-group won’t get shown.
CLIs are really powerful! In the end, you should can write in your README how you used your CLI on your data to run your experiment. This gives others the best shot at reproducing your work. Happy hunting, and see you the next installment of “How to Code with Me”!