This is a Streamlit application that allows users to upload PDF files, extract text from them, and display the extracted text in a clean content.
- Upload PDF Files: Easily upload PDF documents for text extraction.
- Text Extraction: Extracts and processes text from PDFs using PyMuPDF (Fitz).
- Interactive Interface: User-friendly interface to upload files and view results instantly.
- Python 3.9 or higher
- Libraries:
streamlitpymupdfunidecode
-
Clone this repository or download the code:
https://github.com/edgelearningcentre/pdf2text_parser.git cd pdf2text_parser -
Install the required Python packages:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run app.py
- Open the Streamlit app in your browser (usually at
http://localhost:8501). - Upload a PDF file using the file uploader.
- Click the "Extract Text" button.
- View the extracted and formatted text displayed in the app.
- Uploaded PDF: A research paper or document.
- Extracted Text: Structured content
- The accuracy of heading detection relies on simple heuristics and may require adjustments for complex PDF layouts.
- Currently supports only text-based PDFs, not scanned image PDFs.
Contributions are welcome! Feel free to submit issues or pull requests to improve this app.
This project is licensed under the MIT License. See the LICENSE file for more details.
Author: edgelearningcentre Contact: [email protected]