purvasingh96
diff --git a/‎Chapter-wise code/Code - PyTorch/6. Natural-Language-Processing/1. Naive Bayes Spam Classifier/Readme.md‎
Lines changed: 86 additions & 0 deletions b/‎Chapter-wise code/Code - PyTorch/6. Natural-Language-Processing/1. Naive Bayes Spam Classifier/Readme.md‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎Chapter-wise code/Code - PyTorch/6. Natural-Language-Processing/1. Naive Bayes Spam Classifier/images/NLP Pipeline.png‎
423 KB b/‎Chapter-wise code/Code - PyTorch/6. Natural-Language-Processing/1. Naive Bayes Spam Classifier/images/NLP Pipeline.png‎
423 KB
diff --git a/‎Chapter-wise code/Code - PyTorch/6. Natural-Language-Processing/1. Naive Bayes Spam Classifier/images/readme.txt‎
Lines changed: 1 addition & 0 deletions b/‎Chapter-wise code/Code - PyTorch/6. Natural-Language-Processing/1. Naive Bayes Spam Classifier/images/readme.txt‎
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,86 @@
+# Week 1 : The NLP Pipeline
+
+# Code
+
+Notebook : [NLP Pipeline](https://github.com/purvasingh96/Natural-Language-Specialization/blob/master/Week-1/text_processing.ipynb)
+
+# Summary
+
+## Cleaning
+In this step we perform the following tasks -
+
+1. Get the text (`requests.get(url).text`)
+2. Remove html tags 🏷 using `BeautifulSoup`.
+
+```python
+from bs4 import BeautifulSoup
+soup = BeautifulSoup(r.text, "html5lib")
+print(soup.get_text())
+```
+3. Perform web-scrapping 🕷 
+```python
+# Extract title
+summaries[0].find("a", class_="storylink").get_text().strip()
+```
+
+## Normalization
+
+### Case Normalization
+
+Convert all text to lower case. `text.lower()`.
+
+### Punctuation Removal
+
+Remove all punctuation marks. 
+```python
+import re
+re.sub(r"[^a-zA-Z0-9]", " ", text) 
+```
+
+## Tokenization
+
+### Split the text
+Token all words in a text or tokenize the text on sentence level.
+
+
+```python
+
+from nltk.tokenize import word_tokenize, sent_tokenize
+
+# Split text into words using NLTK
+words = word_tokenize(text)
+
+# Split text into sentences
+sentences = sent_tokenize(text)
+```
+
+### Remove stop-words
+Stop words include words such as *'i', 'me', 'my', 'myself', 'we', 'our', 'ours' etc* which increase our vocab size unecessarily. We need to remove them as follows -
+```python
+from nltk.corpus import stopwords
+
+# Remove stop words
+words = [w for w in words if w not in stopwords.words("english")]
+```
+
+## Stemming/ Lemmatization
+
+Stemming reduces a word to its *stem*. Lemmatization reduces the words to it *root.* The difference between the 2 process, is that sometimes Stemming may not generate meaningful words, but the root word generated by lemmatization is always meanigful.
+
+```python
+from nltk.stem.porter import PorterStemmer
+from nltk.stem.wordnet import WordNetLemmatizer
+
+stemmed = [PorterStemmer().stem(w) for w in words]
+lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
+```
+
+## Final NLP Pipeline
+
+Final pipeline for text pre-processing looks aas follows - 
+
+<img src="./images/NLP Pipeline.png" height="300"></img>
+
+
+
+
@@ -0,0 +1 @@
+