Skip to content

naco-siren/html-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML Data Extractor

Automatic extraction of data records from web pages based on DOM tree structure and visual information.

Environment

  • OS: Windows with JDK 8
  • Software: Mozilla firefox (32 bit)

Usage and Demo

  1. Use the latest version of IntelliJ IDEA to import the project.

  2. Run the Main class src/Main.java, a Swing GUI will pop up, as shown in the screenshot below:

    GUI

    1. Paste the URL of the web page to extract data records from to the Source URL text input.
    2. The Expected records count input can be left blank as it won't affect the result.
    3. Select at least one of the two checkbox for extraction strategy: area and node distance.
    4. Choose one output format from json, xml, or json & xml.
    5. Finally, click on the "parse" button execute the extraction. The extracted records will show at the bottom's text edit, as shown below:
      • Google Scholar: Google Scholar output example
      • Best Buy: Best Buy output example
  3. To test on different web pages automatically, please refer to test/TestHTMLDataExtractor.java, where you can modify and run the method testWebpages().

About

Extract data records from web pages based on DOM tree structure and visual info.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •