Skip to content

vineeth729/URL-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Stage Web Scraper with Node.js, Puppeteer & Flask

Docker Node.js Python Flask Puppeteer

This project demonstrates a multi-stage Docker build that:

  • 🕷️ Scrapes data from a user-provided URL using Node.js, Puppeteer, and Chromium
  • 🐍 Serves the scraped data as JSON via a lightweight Flask web server
  • 🐳 Uses Docker multi-stage builds to keep the final image small and efficient

🛠 Technologies Used

  • Node.js 18 (slim)
  • Puppeteer (headless browser automation)
  • Chromium
  • Python 3.10 (slim)
  • Flask (Python web server)
  • Docker (Multi-stage build)

📁 Project Structure

project/
├── Dockerfile             # Multi-stage Dockerfile (Node.js + Python)
├── scrape.js              # Puppeteer script to scrape title and heading
├── server.py              # Flask server to serve scraped JSON
├── requirements.txt       # Flask dependency

⚙️ How It Works

1️⃣ Scraper Stage (Node.js + Puppeteer)

  • Installs Chromium and Puppeteer
  • Accepts a SCRAPE_URL as a build argument
  • Uses Puppeteer to scrape:
    • <title> of the page
    • First <h1> heading
  • Outputs scraped_data.json

2️⃣ Final Stage (Python + Flask)

  • Copies only scraped_data.json into a minimal Python image
  • Runs a Flask server that serves the JSON on /

🚀 Getting Started

✅ Prerequisites


🧱 Build the Docker Image

docker build --build-arg SCRAPE_URL=https://example.com -t scraper-server .

Replace https://example.com with the target website you want to scrape.

🏃 Run the Container

docker run -p 5000:5000 scraper-server

Then open your browser and navigate to:

http://localhost:5000

✅ Example Output

{
  "title": "Example Domain",
  "heading": "Example Domain"
}

About

A small personal project that scrapes the URL of any given site and serves the output data in JSON.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors