Skip to content

Commit 2ab0646

Browse files
committed
switched to kaggle dataset hosted by David Shinn
1 parent dad0ed9 commit 2ab0646

File tree

1 file changed

+32
-44
lines changed

1 file changed

+32
-44
lines changed

notebooks/Tutorial.ipynb

Lines changed: 32 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -38,33 +38,24 @@
3838
"cell_type": "markdown",
3939
"metadata": {},
4040
"source": [
41-
"Look at filesystem to see files extracted from BigQuery"
41+
"Look at filesystem to see files extracted from BigQuery (or Kaggle: https://www.kaggle.com/davidshinn/github-issues/)"
4242
]
4343
},
4444
{
4545
"cell_type": "code",
46-
"execution_count": 6,
46+
"execution_count": 9,
4747
"metadata": {},
4848
"outputs": [
4949
{
5050
"name": "stdout",
5151
"output_type": "stream",
5252
"text": [
53-
"-rw-r--r-- 1 root root 272M Jan 16 00:41 seq2seqdata000000000000.csv\r\n",
54-
"-rw-r--r-- 1 root root 272M Jan 16 00:41 seq2seqdata000000000001.csv\r\n",
55-
"-rw-r--r-- 1 root root 272M Jan 16 00:41 seq2seqdata000000000002.csv\r\n",
56-
"-rw-r--r-- 1 root root 273M Jan 16 00:41 seq2seqdata000000000003.csv\r\n",
57-
"-rw-r--r-- 1 root root 273M Jan 16 00:41 seq2seqdata000000000004.csv\r\n",
58-
"-rw-r--r-- 1 root root 273M Jan 16 00:41 seq2seqdata000000000005.csv\r\n",
59-
"-rw-r--r-- 1 root root 273M Jan 16 00:41 seq2seqdata000000000006.csv\r\n",
60-
"-rw-r--r-- 1 root root 273M Jan 16 00:41 seq2seqdata000000000007.csv\r\n",
61-
"-rw-r--r-- 1 root root 272M Jan 16 00:41 seq2seqdata000000000008.csv\r\n",
62-
"-rw-r--r-- 1 root root 272M Jan 16 00:41 seq2seqdata000000000009.csv\r\n"
53+
"-rw-r--r-- 1 40294 40294 2.7G Jan 18 2018 github_issues.csv\r\n"
6354
]
6455
}
6556
],
6657
"source": [
67-
"!ls -lah | grep csv"
58+
"!ls -lah | grep github_issues.csv"
6859
]
6960
},
7061
{
@@ -76,7 +67,7 @@
7667
},
7768
{
7869
"cell_type": "code",
79-
"execution_count": 8,
70+
"execution_count": 11,
8071
"metadata": {},
8172
"outputs": [
8273
{
@@ -115,55 +106,52 @@
115106
" </thead>\n",
116107
" <tbody>\n",
117108
" <tr>\n",
118-
" <th>344296</th>\n",
119-
" <td>\"https://github.com/Tendrl/node-agent/issues/617\"</td>\n",
120-
" <td>node_agent should handle sds native alerts also</td>\n",
121-
" <td>some of the sds alerts do not have clearing alerts. so it always present in alerting directory. these kinds of alerts should be stored in etcd under /alerting/notify, it never goes to alerting/alerts directory and it is not displayed under alerts in ui also. these kinds of alerts are notified via notification channel and deleted via ttl. node_agent should have a logic to handle this in alerting framework.</td>\n",
109+
" <th>3165423</th>\n",
110+
" <td>\"https://github.com/1000hz/bootstrap-validator/issues/574\"</td>\n",
111+
" <td>uncaught typeerror: f b is not a function when using $ ... .validator 'update'</td>\n",
112+
" <td>the above error is being thrown when i try and run update via js to include some new fields that have been added dynamically. i'm using backbone.js rendering a script template element to add a new set up fields based on user interaction. the full error message is: uncaught typeerror: f b is not a function at htmlformelement.&lt;anonymous&gt; validator.min.js:9 at function.each jquery.min.js:2 at n.fn.init.each jquery.min.js:2 at n.fn.init.b as validator validator.min.js:9 at n.initskillgroup app.l...</td>\n",
122113
" </tr>\n",
123114
" <tr>\n",
124-
" <th>177469</th>\n",
125-
" <td>\"https://github.com/Eonasdan/bootstrap-datetimepicker/issues/2032\"</td>\n",
126-
" <td>dst problems with some timezones</td>\n",
127-
" <td>hello! i have created a datetimepicker with approximataly following config: $element.datetimepicker { locale: 'ru', timezone: 'europe/moscow', defaultdate: moment 614116800000 , format: 'dd.mm.yyyy' } but it shows the date 17.06.1989 instead of 18.06.1989. where can be the problem and what are the ways to resolve it? the plugin version is 4.17.47</td>\n",
115+
" <th>2763145</th>\n",
116+
" <td>\"https://github.com/quasar-analytics/quasar/issues/2821\"</td>\n",
117+
" <td>invoke endpoint regression</td>\n",
118+
" <td>problem accures in versions: 21.x.x , 23.x.x and 24.x.x didn't check 22.x.x first query is put to view mount sql select from /test-mount/testdb/flatviz the second one sql select row.seriesone as seriesone, row.seriestwo as seriestwo, min row.measureone as measureone from output_of_first_query as row group by row.seriesone, row.seriestwo order by row.seriesone asc, row.seriestwo asc the third one is sql select from output_of_second_query where seriesone = one-one in 20.14.13 this works as exp...</td>\n",
128119
" </tr>\n",
129120
" <tr>\n",
130-
" <th>243616</th>\n",
131-
" <td>\"https://github.com/Simperium/simperium-js/issues/22\"</td>\n",
132-
" <td>two way sync not working as expected.</td>\n",
133-
" <td>i'm having an issue syncing data. can someone tell me if i'm doing it wrong. i posted this in stackoverflow and got no responses. whats happening is in window one. if i update my teams array then do a bucket.update 'team-1',teams ; in console 2 i see the new updated teams object and its put into simperium properly. however in window 2 when i do the exact same thing after receiving the new teams object window 1 doesn't get the update nor does simperium. code is bellow. var bucket = simperium....</td>\n",
121+
" <th>3882729</th>\n",
122+
" <td>\"https://github.com/msharov/ustl/issues/79\"</td>\n",
123+
" <td>build ustl with clang on linux</td>\n",
124+
" <td>hi, on ubuntu 14.04 clang 3.4, gcc 4.8.4 and fedora 22 clang 3.5, gcc 5.3.1 : cc=clang cxx=clang++ ./configure --libdir=path/to/libsupc++.a without --libdir it searches for libcxxabi when cc=clang make works fine, make check however shows quite a few diffs. is such configuration supposed to work? thanks!</td>\n",
134125
" </tr>\n",
135126
" </tbody>\n",
136127
"</table>\n",
137128
"</div>"
138129
],
139130
"text/plain": [
140-
" issue_url \\\n",
141-
"344296 \"https://github.com/Tendrl/node-agent/issues/617\" \n",
142-
"177469 \"https://github.com/Eonasdan/bootstrap-datetimepicker/issues/2032\" \n",
143-
"243616 \"https://github.com/Simperium/simperium-js/issues/22\" \n",
131+
" issue_url \\\n",
132+
"3165423 \"https://github.com/1000hz/bootstrap-validator/issues/574\" \n",
133+
"2763145 \"https://github.com/quasar-analytics/quasar/issues/2821\" \n",
134+
"3882729 \"https://github.com/msharov/ustl/issues/79\" \n",
144135
"\n",
145-
" issue_title \\\n",
146-
"344296 node_agent should handle sds native alerts also \n",
147-
"177469 dst problems with some timezones \n",
148-
"243616 two way sync not working as expected. \n",
136+
" issue_title \\\n",
137+
"3165423 uncaught typeerror: f b is not a function when using $ ... .validator 'update' \n",
138+
"2763145 invoke endpoint regression \n",
139+
"3882729 build ustl with clang on linux \n",
149140
"\n",
150-
" body \n",
151-
"344296 some of the sds alerts do not have clearing alerts. so it always present in alerting directory. these kinds of alerts should be stored in etcd under /alerting/notify, it never goes to alerting/alerts directory and it is not displayed under alerts in ui also. these kinds of alerts are notified via notification channel and deleted via ttl. node_agent should have a logic to handle this in alerting framework. \n",
152-
"177469 hello! i have created a datetimepicker with approximataly following config: $element.datetimepicker { locale: 'ru', timezone: 'europe/moscow', defaultdate: moment 614116800000 , format: 'dd.mm.yyyy' } but it shows the date 17.06.1989 instead of 18.06.1989. where can be the problem and what are the ways to resolve it? the plugin version is 4.17.47 \n",
153-
"243616 i'm having an issue syncing data. can someone tell me if i'm doing it wrong. i posted this in stackoverflow and got no responses. whats happening is in window one. if i update my teams array then do a bucket.update 'team-1',teams ; in console 2 i see the new updated teams object and its put into simperium properly. however in window 2 when i do the exact same thing after receiving the new teams object window 1 doesn't get the update nor does simperium. code is bellow. var bucket = simperium.... "
141+
" body \n",
142+
"3165423 the above error is being thrown when i try and run update via js to include some new fields that have been added dynamically. i'm using backbone.js rendering a script template element to add a new set up fields based on user interaction. the full error message is: uncaught typeerror: f b is not a function at htmlformelement.<anonymous> validator.min.js:9 at function.each jquery.min.js:2 at n.fn.init.each jquery.min.js:2 at n.fn.init.b as validator validator.min.js:9 at n.initskillgroup app.l... \n",
143+
"2763145 problem accures in versions: 21.x.x , 23.x.x and 24.x.x didn't check 22.x.x first query is put to view mount sql select from /test-mount/testdb/flatviz the second one sql select row.seriesone as seriesone, row.seriestwo as seriestwo, min row.measureone as measureone from output_of_first_query as row group by row.seriesone, row.seriestwo order by row.seriesone asc, row.seriestwo asc the third one is sql select from output_of_second_query where seriesone = one-one in 20.14.13 this works as exp... \n",
144+
"3882729 hi, on ubuntu 14.04 clang 3.4, gcc 4.8.4 and fedora 22 clang 3.5, gcc 5.3.1 : cc=clang cxx=clang++ ./configure --libdir=path/to/libsupc++.a without --libdir it searches for libcxxabi when cc=clang make works fine, make check however shows quite a few diffs. is such configuration supposed to work? thanks! "
154145
]
155146
},
156-
"execution_count": 8,
147+
"execution_count": 11,
157148
"metadata": {},
158149
"output_type": "execute_result"
159150
}
160151
],
161152
"source": [
162153
"#read in data sample 2M rows (for speed of tutorial)\n",
163-
"traindf, testdf = train_test_split(\n",
164-
" pd.concat([\n",
165-
" pd.read_csv(f) for f in glob.glob('*.csv')\n",
166-
" ]).sample(n=2000000), \n",
154+
"traindf, testdf = train_test_split(pd.read_csv('github_issues.csv').sample(n=2000000), \n",
167155
" test_size=.10)\n",
168156
"\n",
169157
"\n",
@@ -1544,7 +1532,7 @@
15441532
"outputs": [],
15451533
"source": [
15461534
"# Read All 5M data points\n",
1547-
"all_data_df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')])\n",
1535+
"all_data_df = pd.read_csv('github_issues.csv')\n",
15481536
"# Extract the bodies from this dataframe\n",
15491537
"all_data_bodies = all_data_df['body'].tolist()"
15501538
]

0 commit comments

Comments
 (0)