diff --git a/README.md b/README.md index 8a2e1dbc..ee3ecd01 100644 --- a/README.md +++ b/README.md @@ -453,17 +453,21 @@ This tells Docker to use the We built Plz following these principles: -- Code and data must be stored for future reference. +- Data that isn't reproducible is worthless. +- You don't know the value of your data at the time of creation. - Whatever part of the running environment can be captured by Plz, we capture it as to make jobs repeatable. +- Hardware is expensive. +- Code is a means to an end. What matters is the outcome you obtain from running + your code. - Functionality is based on standard mechanisms like files and environment variables. You don't need to add extra dependencies to your code or learn how to read/write your data in specific ways. - The tool must be flexible enough so that no unnecessary restrictions are - imposed by the architecture. You should be able to do with Plz whatever you - can do by running a program manually. It was surprising to find out how many - issues, mostly around running jobs in the cloud, could be solved only by - tweaking the configuration, without requiring any changes to the code. + imposed by its architecture. You should be able to do with Plz whatever you + can do by running a program manually. It was surprising to find out how much + of the friction around running jobs in the cloud could be solved, only by + tweaking the configuration and without requiring any changes to Plz code. Plz is routinely used at `prodo.ai` to train ML models on AWS, some of them taking days to run in the most powerful instances available. We trust it to @@ -471,6 +475,143 @@ start and terminate these instances as needed, and to manage our spot instances, allowing us to get a much better price than if we were using on-demand instances all the time. +## How does Plz help + +If you didn't have Plz, the steps you'd need as to run your code in an AWS +instance would be: + +- go to the AWS console and start an instance (or have created a launch template + and then use the cli) +- wait until the instance is up +- get the IP address of the instance from the console +- copy your code and data by ssh-ing to the instance +- ssh to the instance and run your job. Preferably inside docker so that a + dropped connection doesn't kill your job (but if you want to have docker you + have to take care of a `Dockerfile` and build the image) +- each time the connection drops or you turn off your computer, you need to ssh + again. If you didn't use docker, you lost your terminal and it's very likely + your job died +- watch your job until it finishes (and lose money at the point it has already + finished and your instance keeps running, if you don't look often enough) +- copy your results back to your machine by ssh-ing to the instance, being + disciplined about where you store them and making sure you can link them to + the (version of the) code that produced them if you'll have several runs that + you want to compare. Or, if you started from a program that was running + locally, change it to write to a non-ephemeral location +- if you care about your standard output/logs, gather and retrieve them somehow +- make a note of your results (like stats or accuracy), or copy files with + results + +All of that gets simplified to `plz run`. If you stopped the output of `plz run` +(by hitting Ctrl-C, or turning off your computer) can do `plz output` to get the +output at any time. + +If you want to rerun your job later (for instance, to try different parameters), +you would need to have saved a copy of the code (or have been very disciplined +with your git history, plus have tags or have commits for every single one-line +tweak you try --more about that [below](#why-is-Plz-the-way-it-is)), and +possibly also have the same data you have used. You'd need to retrieve the code +from wherever you have it (for instance, you may need to find the git branch, +and switch to it --possibly after creating a different copy of the repo if you +don't want to stop working on what you are doing it). + +Another important factor is that it gives you a standard way to run your code. +Same as when you see a Makefile and you know that you can type `make`, when you +see a `plz.config.json` you know that you can do `plz run`. Then your code can +be launched in whatever machine your teammate happens to be sitting (specially +if the job runs in the cloud). Teammates need to install `plz`, sure, but your +team will know how to do it after a couple installs, and that's setting up one +program per team member, instead of one setup per project. + +## Why is Plz the way it is + +This section is an attempt to describe the rationale behind the high-level +architecture of Plz. + +- why Docker: simplifies input and output, which results in concrete + simplifications like log handling: we obtain a stream of logs from the running + jobs just by calling the Docker API, with facilities to filter for time. + Running commands with ssh requires to either keep the connection as to gather + the output or redirecting the output and reading it later from a file. In + general, Docker doesn't provide only isolation, but also an environment where + the job runs autonomously with controlled inputs and outputs +- why not using git to store code snapshots (and use git to transfer code to the + instance): because it's very common that users want to make changes that they + don't necessarily want in their commit history. For instance, when users try + to make their job run in the cloud, or to run it at a different scale than + what they use to (for instance, to run the job with far more data than they do + locally), they might try several one-line tweaks. These commits (possibly + paired with messages that would be meaningless in a month, like ''Change + foobar from 0 to 1'') are hardly useful and pollute the repo history. Plz + could also create a different branch for each job but (in order to allow for + `plz rerun`) then these branches should be kept, would be listed in + `git branch`, etc. _A good summary answer to the question would be: because + users want to commit stuff that ''works'' (commits you can revert to, use for + reference, etc.) and you don't know whether something works until you've run + it._ The solution for code storage we implemented, using Docker images, is + quite simple to implement and understand as the docker API allows you to just + send the files as a tarball in order to create an image (if we were using git, + for the case of private repos, we would need to implement usage of git + credentials in the instance, which would actually be more complicated than + using Docker). Docker images are given a name so that they can be referenced + later, making `plz rerun` easy to implement as well. The code can be retrieved + by looking inside the image, which is a reliable source of truth, as it stores + the code that was actually running + +### Could Plz be smaller? + +- why do we need a controller/server: one reason is to manage locks (for + instance, to avoid two jobs requests using the same instance). It's true that + locking could be done just by using a redis server (so instead of a + controller/Plz server, the CLI could maybe point to a redis server taking care + of locks). That would force the tool to assume that everyone uses it + collaboratively (one could engineer an altered CLI that locks every instance, + etc.). We are having this assumption now, but we are not forced to keep it in + the future. Another reason for a controller is a feature that we have + considered for a while: to rerun jobs for spot instances that were terminated + because of being overbid. To that end, we need something to be running + permanently the cloud, as there might not be a CLI running at the point in + time where the spot instance is terminated. In general, it sounds like if you + want to do something serious about a bunch of instances that are running + permanently, eventually you'll need a coordinator/controller. Even if the + current features might not strictly require a controller, it's good that any + features that do require it won't need a major refactor. Needless to say, a + controller-less Plz cannot be obtained by just erasing the controller: a major + effort would be needed so that the tasks being done by the controller (setting + inputs, collecting outputs, etc.) are done by, for instance, a wrapper of the + program being run by the user +- why collecting information from the running program: while it would be + possible to leave to user programs the task to write to whatever non-ephemeral + storage they choose, that would put a burden on the Plz user to change their + program significantly, with respect to a program that they already run locally + (for instance, instead of writing local files, to use the AWS API to write to + S3). With the current Plz mechanism, as long as there is a single point in + your program where you can set the output directory (and, if you program + doesn't have such point, it's a good idea to implement it anyway) you can + write files and Plz will make them non-ephemeral for you. Also, with the + current mechanism team members know how to access the outcomes of your job + even if they don't know the details (`plz output` for ''blobs'' and + `plz measures` for structured outputs), and can read them using standard + tools, as every computer setup can process json and files (as opposed to, say, + running SQL queries in the cloud) +- why managing the instances ourselves/why not using Kubernetes: because + autoscaling mechanisms (either using Kubernetes or autoscaling groups) do not + cover the case of ''interactive users'' which want to run instances, see them + spawn when they launch a job and see them terminate when they stop it. + Autoscaling mechanisms have ''cooling times'' specified so that scaling + changes are not happening all the time, degrading performance, but they cause + operations not to be immediate/deterministic and that can be really annoying + when working interactively. We discovered all of these because our first + attempt was to use AWS autoscaling groups, and that Plz was a pain to use and + also to test manually (''did AWS get that we want to terminate this instance? + Let's wait, sometimes it takes 5 minutes to take it down''). With respect to + Kubernetes specifically, when we started Plz the Kubernetes implementation of + AWS (EKS) wasn't there. There is a feature for Kubernetes in the works. We + plan that users would be able to specify a Kubernetes cluster to which the + execution will be sent (to support the case of a non-interactive user), or, as + we currently do, specify an instance type, so that an instance will be started + and managed by Plz (to support the case of an interactive user) + ## Future work In the future, Plz is intended to: