Skip to content

Commit b4e08bd

Browse files
committed
Refactor: third phase of RESTRUCTURE.md
1 parent fc5ef85 commit b4e08bd

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+132
-68
lines changed

.readthedocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ sphinx:
2121
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
2222
python:
2323
install:
24-
- requirements: requirements_docs.txt
24+
- requirements: dependencies/requirements/requirements_docs.txt

PREFLIGHT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ bash preflight.sh PLATFORM=GCE && numactl --membind 0 --cpunodebind=0 python3 -m
2626
```
2727

2828
For GKE,
29-
`numactl` should be built into your docker image from [maxtext_dependencies.Dockerfile](https://github.com/google/maxtext/blob/main/maxtext_dependencies.Dockerfile), so you can use it directly if you built the maxtext docker image. Here is an example
29+
`numactl` should be built into your docker image from [maxtext_dependencies.Dockerfile](https://github.com/google/maxtext/blob/main/dependencies/dockerfiles/maxtext_dependencies.Dockerfile), so you can use it directly if you built the maxtext docker image. Here is an example
3030

3131
```
3232
bash preflight.sh PLATFORM=GKE && numactl --membind 0 --cpunodebind=0 python3 -m MaxText.train src/MaxText/configs/base.yml run_name=$YOUR_JOB_NAME

dependencies/dockerfiles/maxtext_db_dependencies.Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ ENV MAXTEXT_REPO_ROOT=/deps
4040
WORKDIR /deps
4141

4242
# Copy setup files and dependency files separately for better caching
43-
COPY tools/setup /deps/tools/setup/
44-
COPY dependencies/requirements/ /deps/dependencies/requirements/
45-
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt /deps/dependencies/requirements/
43+
COPY tools/setup tools/setup/
44+
COPY dependencies/requirements/ dependencies/requirements/
45+
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt src/install_maxtext_extra_deps/
4646

4747
# Install dependencies - these steps are cached unless the copied files change
4848
RUN echo "Running command: bash setup.sh MODE=$ENV_MODE JAX_VERSION=$ENV_JAX_VERSION LIBTPU_GCS_PATH=${ENV_LIBTPU_GCS_PATH} DEVICE=${ENV_DEVICE}"

dependencies/dockerfiles/maxtext_dependencies.Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ ENV MAXTEXT_REPO_ROOT=/deps
4040
WORKDIR /deps
4141

4242
# Copy setup files and dependency files separately for better caching
43-
COPY tools/setup /deps/tools/setup/
44-
COPY dependencies/requirements/ /deps/dependencies/requirements/
45-
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt /deps/dependencies/requirements/
43+
COPY tools/setup tools/setup/
44+
COPY dependencies/requirements/ dependencies/requirements/
45+
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt src/install_maxtext_extra_deps/
4646

4747
# Install dependencies - these steps are cached unless the copied files change
4848
RUN echo "Running command: bash setup.sh MODE=$ENV_MODE JAX_VERSION=$ENV_JAX_VERSION LIBTPU_GCS_PATH=${ENV_LIBTPU_GCS_PATH} DEVICE=${ENV_DEVICE}"

dependencies/dockerfiles/maxtext_gpu_dependencies.Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,9 @@ ENV MAXTEXT_REPO_ROOT=/deps
4242
WORKDIR /deps
4343

4444
# Copy setup files and dependency files separately for better caching
45-
COPY tools/setup /deps/tools/setup/
46-
COPY dependencies/requirements/ /deps/dependencies/requirements/
47-
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt /deps/dependencies/requirements/
45+
COPY tools/setup tools/setup/
46+
COPY dependencies/requirements/ dependencies/requirements/
47+
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt src/install_maxtext_extra_deps/
4848

4949
# Install dependencies - these steps are cached unless the copied files change
5050
RUN echo "Running command: bash setup.sh MODE=$ENV_MODE JAX_VERSION=$ENV_JAX_VERSION DEVICE=${ENV_DEVICE}"

dependencies/dockerfiles/maxtext_jax_ai_image.Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ ENV MAXTEXT_REPO_ROOT=/deps
1616
WORKDIR /deps
1717

1818
# Copy setup files and dependency files separately for better caching
19-
COPY tools/setup /deps/tools/setup/
20-
COPY dependencies/requirements/ /deps/dependencies/requirements/
21-
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt /deps/dependencies/requirements/
19+
COPY tools/setup tools/setup/
20+
COPY dependencies/requirements/ dependencies/requirements/
21+
COPY src/install_maxtext_extra_deps/extra_deps_from_github.txt src/install_maxtext_extra_deps/
2222

2323
# For JAX AI tpu training images 0.4.37 AND 0.4.35
2424
# Orbax checkpoint installs the latest version of JAX,

docs/development.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ If you are writing documentation for MaxText, you may want to preview the docume
1212
First, make sure you install the necessary dependencies. You can do this by navigating to your local clone of the MaxText repo and running:
1313

1414
```bash
15-
pip install -r requirements_docs.txt
15+
pip install -r dependencies/requirements/requirements_docs.txt
1616
```
1717

1818
Once the dependencies are installed, you can navigate to the `docs/` folder and run:

docs/guides/data_input_pipeline/data_input_grain.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,17 +29,17 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
2929

3030
## Using Grain
3131
1. Grain currently supports two data formats: [ArrayRecord](https://github.com/google/array_record) (random access) and [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources.html) class.
32-
2. When the dataset is hosted on a Cloud Storage bucket, Grain can read it through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/setup_gcsfuse.sh). The script configures some parameters for the mount.
33-
```
34-
bash setup_gcsfuse.sh \
32+
2. When the dataset is hosted on a Cloud Storage bucket, Grain can read it through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount.
33+
```sh
34+
bash tools/setup/setup_gcsfuse.sh \
3535
DATASET_GCS_BUCKET=$BUCKET_NAME \
3636
MOUNT_PATH=$MOUNT_PATH \
3737
[FILE_PATH=$MOUNT_PATH/my_dataset]
3838
# FILE_PATH is optional, when provided, the script runs "ls -R" for pre-filling the metadata cache
3939
# https://cloud.google.com/storage/docs/cloud-storage-fuse/performance#improve-first-time-reads
4040
```
4141
3. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet}`, `grain_train_files` to match the file pattern on the mounted local path.
42-
4. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html), [grain_pool.py](https://github.com/google/grain/blob/main/grain/_src/python/grain_pool.py)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/setup_gcsfuse.sh) to avoid gcsfuse throttling.
42+
4. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html), [grain_pool.py](https://github.com/google/grain/blob/main/grain/_src/python/grain_pool.py)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh) to avoid gcsfuse throttling.
4343

4444
5. For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and colon (:) for weights. The weights will be automatically normalized to sum to 1.0. For example:
4545
```
@@ -52,8 +52,8 @@ grain_train_files=/tmp/gcsfuse/dataset1.array_record*:1;/tmp/gcsfuse/dataset2.ar
5252
Note: When using multiple data sources, only the ArrayRecord format is supported.
5353

5454
6. Example command:
55-
```
56-
bash setup_gcsfuse.sh \
55+
```sh
56+
bash tools/setup/setup_gcsfuse.sh \
5757
DATASET_GCS_BUCKET=maxtext-dataset \
5858
MOUNT_PATH=/tmp/gcsfuse && \
5959
python3 -m MaxText.train src/MaxText/configs/base.yml \

docs/guides/data_input_pipeline/data_input_tfds.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# TFDS pipeline
22

33
1. Download the Allenai C4 dataset in TFRecord format to a Cloud Storage bucket. For information about cost, see [this discussion](https://github.com/allenai/allennlp/discussions/5056)
4-
```
5-
bash download_dataset.sh {GCS_PROJECT} {GCS_BUCKET_NAME}
4+
```sh
5+
bash tools/data_generation/download_dataset.sh ${GCS_PROJECT} ${GCS_BUCKET_NAME}
66
```
77
2. In `src/MaxText/configs/base.yml` or through command line, set the following parameters:
88
```yaml

docs/guides/knowledge_distillation.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,12 +47,13 @@ export RUN_NAME = <unique name for the run>
4747

4848
#### b. Install dependencies
4949

50-
```
50+
```sh
5151
git clone https://github.com/AI-Hypercomputer/maxtext.git
5252
python3 -m venv ~/venv-maxtext
5353
source ~/venv-maxtext/bin/activate
54+
python3 -m pip install uv
5455
cd maxtext
55-
uv pip install -r requirements.txt
56+
uv pip install -r dependencies/requirements/requirements.txt
5657
```
5758

5859
### 1. Obtain and prepare the teacher model

0 commit comments

Comments
 (0)