Grokking Modular Arithmetic (MLX)

Grokking: a phenomenon observed in neural nets, where after an initial phase of overfitting (or memorization), the model suddenly achieves perfect generalization, inspired by Power et al. (2022). We incoporate some modern Transformer tricks (e.g., RoPE, RMSNorm, SiLU, etc.) and achieve grokking in < 150 epochs on modular division when $p=97$ on 50% of the training data using a 2 layer, 1 head, 128 dim net.

Background

We define modular arithmetic for the following operations given a prime modulus $p$ and $(a, b)$ for $0 \leq a \lt p, 0 \lt b \lt p$:

Addition: $a \circ b = a + b \mod p$
Subtraction: $a \circ b = a - b \mod p$
Multiplication: $a \circ b = a \cdot b \mod p$
Division: $a \circ b = a / b \mod p$, using Fermat’s Little Theorem which states that $b^{p-1} \equiv 1 \mod p$ for any $b$ not divisible by $p$.

Running

Run with default params for $a / b \mod p$ and save the result in media/grokking.png:

python main.py

main.py: training and evaluation loops
models.py: defines the Transformer model
data.py: generate the dataset

Dependencies

Install the dependencies (optimized for Apple silicon; yay for MLX!):

pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Grokking Modular Arithmetic (MLX)

Background

Running

Dependencies

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
media		media
.gitignore		.gitignore
README.md		README.md
data.py		data.py
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt

CodeWithBehnam/mlx-grokking

Folders and files

Latest commit

History

Repository files navigation

Grokking Modular Arithmetic (MLX)

Background

Running

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages