Skip to content

sonic0002/defuddle-php

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

defuddle-php

PHP 7.1+ port of defuddle — extract main content and structured metadata from any HTML page.

Defuddle removes navigation, ads, sidebars, comments, and other clutter, leaving just the article content. It also extracts structured metadata like title, author, published date, and Open Graph/Schema.org data.

Requirements

  • PHP 7.1+
  • Extensions: ext-dom, ext-json, ext-mbstring

Installation

composer require defuddle/defuddle-php

Usage

use Defuddle\Defuddle;

$html = file_get_contents('https://example.com/article');
$result = (new Defuddle($html, 'https://example.com/article'))->parse();

echo $result->title;       // "Article Title"
echo $result->author;      // "Jane Smith"
echo $result->published;   // "2025-01-15T10:00:00+00:00"
echo $result->content;     // cleaned HTML of the main article body
echo $result->wordCount;   // 842

With options

use Defuddle\Defuddle;
use Defuddle\DefuddleOptions;

$options = new DefuddleOptions();
$options->debug = true;

$result = (new Defuddle($html, $url))->parse($options);

Output fields

Field Type Description
content string Cleaned HTML of the main article content
title string Article title (site name stripped)
description string Article description or excerpt
author string Author name(s)
published string Publication date (ISO 8601)
site string Site or publication name
domain string Domain name (www. stripped)
favicon string URL to site favicon
image string URL to article lead image
language string Content language (e.g. en, en-US)
wordCount int Approximate word count of extracted content
parseTime float Parse time in milliseconds
schemaOrgData array|null Parsed Schema.org JSON-LD data
metaTags array All <meta> tags as [name, property, content]

What gets extracted

  • Title: From Open Graph, Twitter Card, Schema.org, or <title>, with site name removed
  • Author: From meta tags, Schema.org, .author/.byline elements, or "By X" patterns
  • Published date: From article:published_time, Schema.org datePublished, <time>, or natural language
  • Image: From og:image, twitter:image, Schema.org, or first large <img>
  • Language: From <html lang> or og:locale

What gets removed

  • Scripts, styles, navigation, headers, footers
  • Ads, sidebars, comments, newsletter signups
  • Hidden elements, small images
  • Author bylines, breadcrumbs, read-time indicators, trailing boilerplate

License

MIT — see LICENSE

Based on defuddle by @kepano (MIT).

About

PHP version of defuddle(https://github.com/kepano/defuddle)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages