Skip to content

Commit 87fa1be

Browse files
committed
master: Adding NLP folder.
1 parent 7362b09 commit 87fa1be

File tree

78 files changed

+334851
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+334851
-0
lines changed
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Week 1 : The NLP Pipeline
2+
3+
# Code
4+
5+
Notebook : [NLP Pipeline](https://github.com/purvasingh96/Natural-Language-Specialization/blob/master/Week-1/text_processing.ipynb)
6+
7+
# Summary
8+
9+
## Cleaning
10+
In this step we perform the following tasks -
11+
12+
1. Get the text (`requests.get(url).text`)
13+
2. Remove html tags 🏷 using `BeautifulSoup`.
14+
15+
```python
16+
from bs4 import BeautifulSoup
17+
soup = BeautifulSoup(r.text, "html5lib")
18+
print(soup.get_text())
19+
```
20+
3. Perform web-scrapping 🕷
21+
```python
22+
# Extract title
23+
summaries[0].find("a", class_="storylink").get_text().strip()
24+
```
25+
26+
## Normalization
27+
28+
### Case Normalization
29+
30+
Convert all text to lower case. `text.lower()`.
31+
32+
### Punctuation Removal
33+
34+
Remove all punctuation marks.
35+
```python
36+
import re
37+
re.sub(r"[^a-zA-Z0-9]", " ", text)
38+
```
39+
40+
## Tokenization
41+
42+
### Split the text
43+
Token all words in a text or tokenize the text on sentence level.
44+
45+
46+
```python
47+
48+
from nltk.tokenize import word_tokenize, sent_tokenize
49+
50+
# Split text into words using NLTK
51+
words = word_tokenize(text)
52+
53+
# Split text into sentences
54+
sentences = sent_tokenize(text)
55+
```
56+
57+
### Remove stop-words
58+
Stop words include words such as *'i', 'me', 'my', 'myself', 'we', 'our', 'ours' etc* which increase our vocab size unecessarily. We need to remove them as follows -
59+
```python
60+
from nltk.corpus import stopwords
61+
62+
# Remove stop words
63+
words = [w for w in words if w not in stopwords.words("english")]
64+
```
65+
66+
## Stemming/ Lemmatization
67+
68+
Stemming reduces a word to its *stem*. Lemmatization reduces the words to it *root.* The difference between the 2 process, is that sometimes Stemming may not generate meaningful words, but the root word generated by lemmatization is always meanigful.
69+
70+
```python
71+
from nltk.stem.porter import PorterStemmer
72+
from nltk.stem.wordnet import WordNetLemmatizer
73+
74+
stemmed = [PorterStemmer().stem(w) for w in words]
75+
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
76+
```
77+
78+
## Final NLP Pipeline
79+
80+
Final pipeline for text pre-processing looks aas follows -
81+
82+
<img src="./images/NLP Pipeline.png" height="300"></img>
83+
84+
85+
86+
423 KB
Loading
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

0 commit comments

Comments
 (0)