Skip to content

Conversation

@shayan74
Copy link

@shayan74 shayan74 commented Nov 7, 2025

Dear Jadi,

Thank you for creating such a wonderful machine learning course — I’ve been recommending it to Persian-speaking students who are eager to learn ML.

While reviewing the code, I noticed a small detail in the train/test split logic that might cause slight variations in the ratio. The current approach:

np.random.rand(len(df)) < 0.8

works well in general, but due to randomness, it may yield ratios anywhere between roughly 77% to 82% for training data. This is perfectly acceptable for large datasets, but in smaller datasets it can lead to noticeable deviations and potential confusion for learners.

To make the ratio more consistent, I suggest using:

def random_boolean_array(x, true_ratio=0.8):
n_true = int(x * true_ratio)
n_false = x - n_true
arr = np.array([True] * n_true + [False] * n_false)
np.random.shuffle(arr)
return arr

This approach tends to produce a more stable 80/20 distribution.

Or, for keeping the inline coding style (and avoiding manual splitting functions) we can improve this by using:

np.random.choice([True, False], size=len(df), p=[0.8, 0.2])

Thank you for your time and for the excellent educational content you share.

Damet Garm!
Shayan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant