HTML Data Extractor

Automatic extraction of data records from web pages based on DOM tree structure and visual information.

Environment

Use the latest version of IntelliJ IDEA to import the project.
Run the Main class src/Main.java, a Swing GUI will pop up, as shown in the screenshot below:
1. Paste the URL of the web page to extract data records from to the Source URL text input.
2. The Expected records count input can be left blank as it won't affect the result.
3. Select at least one of the two checkbox for extraction strategy: area and node distance.
4. Choose one output format from json, xml, or json & xml.
5. Finally, click on the "parse" button execute the extraction. The extracted records will show at the bottom's text edit, as shown below:
  - Google Scholar:
  - Best Buy:
To test on different web pages automatically, please refer to test/TestHTMLDataExtractor.java, where you can modify and run the method testWebpages().

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.idea		.idea
injected_HTML		injected_HTML
lib		lib
output_samples		output_samples
src		src
test		test
CS511-final-project.iml		CS511-final-project.iml
README.md		README.md
_config.yml		_config.yml