Automatic extraction of data records from web pages based on DOM tree structure and visual information.
- OS: Windows with JDK 8
- Software: Mozilla firefox (32 bit)
-
Use the latest version of IntelliJ IDEA to import the project.
-
Run the Main class
src/Main.java, a Swing GUI will pop up, as shown in the screenshot below:- Paste the URL of the web page to extract data records from to the
Source URLtext input. - The
Expected records countinput can be left blank as it won't affect the result. - Select at least one of the two checkbox for extraction strategy:
areaandnode distance. - Choose one output format from
json,xml, orjson & xml. - Finally, click on the "parse" button execute the extraction. The extracted records will show at the bottom's text edit, as shown below:
- Paste the URL of the web page to extract data records from to the
-
To test on different web pages automatically, please refer to
test/TestHTMLDataExtractor.java, where you can modify and run the methodtestWebpages().