Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

readme.md

PDF Table Scraper

A program that auto-extracts tables from multiple PDFs and saves the tables from each scrapped PDF into a seperate CSV file.

Required modules

  • tabula-py
pip3 install tabula-py
  • Make sure Java is installed as there is a wrapper is used in the scrapping process
sudo apt install default-jre

Usage

  1. First time use (run once): Running the script for the first time will deploy the required folder PDFs in which you put the PDFs meant to scrap and the other will have the extracted tables.
python3 scraper.py
  1. Copy the PDFs you want to scrap into PDFs folder.

  2. Re-run the script and wait for it to finish. A folder tables will be created containing the scrapped tables.

python3 scraper.py
  1. A small summary will be included in the terminal window recalling the successful and failed PDFs scrapped.

Demonstration Video

PDFTableScraper.mp4