PHP 7.1+ port of defuddle — extract main content and structured metadata from any HTML page.
Defuddle removes navigation, ads, sidebars, comments, and other clutter, leaving just the article content. It also extracts structured metadata like title, author, published date, and Open Graph/Schema.org data.
- PHP 7.1+
- Extensions:
ext-dom,ext-json,ext-mbstring
composer require defuddle/defuddle-phpuse Defuddle\Defuddle;
$html = file_get_contents('https://example.com/article');
$result = (new Defuddle($html, 'https://example.com/article'))->parse();
echo $result->title; // "Article Title"
echo $result->author; // "Jane Smith"
echo $result->published; // "2025-01-15T10:00:00+00:00"
echo $result->content; // cleaned HTML of the main article body
echo $result->wordCount; // 842use Defuddle\Defuddle;
use Defuddle\DefuddleOptions;
$options = new DefuddleOptions();
$options->debug = true;
$result = (new Defuddle($html, $url))->parse($options);| Field | Type | Description |
|---|---|---|
content |
string |
Cleaned HTML of the main article content |
title |
string |
Article title (site name stripped) |
description |
string |
Article description or excerpt |
author |
string |
Author name(s) |
published |
string |
Publication date (ISO 8601) |
site |
string |
Site or publication name |
domain |
string |
Domain name (www. stripped) |
favicon |
string |
URL to site favicon |
image |
string |
URL to article lead image |
language |
string |
Content language (e.g. en, en-US) |
wordCount |
int |
Approximate word count of extracted content |
parseTime |
float |
Parse time in milliseconds |
schemaOrgData |
array|null |
Parsed Schema.org JSON-LD data |
metaTags |
array |
All <meta> tags as [name, property, content] |
- Title: From Open Graph, Twitter Card, Schema.org, or
<title>, with site name removed - Author: From meta tags, Schema.org,
.author/.bylineelements, or "By X" patterns - Published date: From
article:published_time, Schema.orgdatePublished,<time>, or natural language - Image: From
og:image,twitter:image, Schema.org, or first large<img> - Language: From
<html lang>orog:locale
- Scripts, styles, navigation, headers, footers
- Ads, sidebars, comments, newsletter signups
- Hidden elements, small images
- Author bylines, breadcrumbs, read-time indicators, trailing boilerplate
MIT — see LICENSE