deduplicate-articles

Feedly API Search with Enhanced Deduplication

This script searches the Feedly API and removes duplicate articles using multiple detection methods to ensure the cleanest possible results.

Problem Solved

When using Feedly's API with similar=true, some duplicate articles may still appear in results because:

Not all articles have a featuredMeme (cluster ID)
Articles from different sources may have very similar content
Articles without clusters can't be deduplicated by cluster

Solution

This script combines multiple deduplication methods:

Feedly's native duplicate detection - Uses the duplicates field when available
Cluster-based deduplication - Groups articles about the same story (when cluster exists)
Title similarity matching - Catches near-identical titles (85% similarity threshold)
Persistent tracking - Remembers previously seen articles across runs

Setup

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

Copy .env.sample to .env and update with your settings:

# Required: Your Feedly Enterprise API token
FEEDLY_API_TOKEN=your_token_here

# Optional: Customize these as needed
FEEDLY_QUERY_FILE=search_query.json       # Path to search query JSON
FEEDLY_OUTPUT_FILE=feedly_results.json    # Output filename
FEEDLY_OUTPUT_FORMAT=csv                  # Output format: json or csv
FEEDLY_SEARCH_DAYS=7                      # Days to look back
FEEDLY_SEARCH_COUNT=100                   # Articles per page
FEEDLY_MAX_PAGES=5                        # Maximum pages to fetch
FEEDLY_VERBOSE=false                      # Enable debug logging
FEEDLY_DB_FILE=feedly_seen_entries.csv    # Database for tracking seen articles
FEEDLY_DB_RETENTION_DAYS=30               # Days to remember articles
FEEDLY_DEDUP_BY_CLUSTER=true             # Enable cluster deduplication

3. Configure Search Query

Edit search_query.json to define your search. Example for threat intelligence:

{
    "layers": [
        {
            "parts": [
                {
                    "id": "nlp/f/entity/gz:ta:68391641-859f-4a9a-9a1e-3e5cf71ec376",
                    "label": "Lazarus Group",
                    "type": "threatActor"
                }
            ],
            "type": "matches",
            "salience": "about"
        }
    ],
    "source": {
        "items": [
            {
                "type": "publicationBucket",
                "id": "byf:cybersecurity-bundle",
                "tier": "tier1",
                "label": "Cybersecurity Bundle"
            }
        ]
    }
}

Usage

Simply run the script:

python feedly_search.py

Output

The script provides detailed statistics showing how duplicates were removed:

==================================================
DEDUPLICATION STATISTICS:
  Total articles retrieved: 100
  Unique articles: 78
  Duplicates removed: 22
  Story clusters found: 27

BREAKDOWN BY METHOD:
  Removed by Feedly duplicates field: 0
  Removed by title similarity: 5
  Removed by cluster deduplication: 17
  Removed by previous database entries: 0
  Articles without clusters: 51
  Duplicate groups found (Feedly): 0
==================================================

Output Files

CSV format: Includes columns for id, title, published date, URL, cluster info, and more
JSON format: Full article data with deduplication metadata
Database file: Tracks seen articles to prevent duplicates across runs

How It Works

Fetches articles from Feedly API with similar=true parameter
Builds duplicate graph from Feedly's duplicates field
Checks title similarity using 85% threshold (configurable)
Groups by cluster when available
Tracks seen articles in persistent database
Outputs deduplicated results with detailed statistics

Key Features

Handles articles without clusters - Uses title similarity when cluster ID is missing
Persistent deduplication - Remembers articles across multiple runs
Configurable thresholds - Adjust similarity sensitivity as needed
Detailed logging - See exactly which method removed each duplicate
Multiple output formats - Choose JSON or CSV based on your needs

Troubleshooting

Still seeing duplicates?

Lower the title similarity threshold - Edit the script and change title_similarity_threshold=0.85 to a lower value like 0.75
Check if articles have different titles - Some duplicates may have slightly different titles from different sources
Enable verbose logging - Set FEEDLY_VERBOSE=true to see detailed deduplication decisions

Performance tips

Adjust page count - Reduce FEEDLY_MAX_PAGES if you don't need all results
Clear old database - Delete feedly_seen_entries.csv to start fresh
Tune retention - Adjust FEEDLY_DB_RETENTION_DAYS based on your needs

Technical Details

The script enhances deduplication by:

Building bidirectional duplicate relationships (if A→B, then B→A)
Using fuzzy string matching for titles (difflib.SequenceMatcher)
Maintaining a sliding window of seen articles (default 30 days)
Combining multiple signals for comprehensive deduplication

Support

For issues or questions about the Feedly API:

Check the deduplication statistics to understand what's happening
Enable verbose logging for detailed debugging
Adjust the title similarity threshold if needed
Consider that some duplicates may be intentional (different perspectives on same story)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Feedly API Search with Enhanced Deduplication

Problem Solved

Solution

Setup

1. Install Dependencies

2. Configure Environment

3. Configure Search Query

Usage

Output

Output Files

How It Works

Key Features

Troubleshooting

Still seeing duplicates?

Performance tips

Technical Details

Support

Name		Name	Last commit message	Last commit date
parent directory ..
.env.sample		.env.sample
README.md		README.md
feedly_search.py		feedly_search.py
requirements.txt		requirements.txt
search_query.json		search_query.json

FilesExpand file tree

deduplicate-articles

Directory actions

More options

Directory actions

More options

Latest commit

History

deduplicate-articles

Folders and files

parent directory

README.md

Feedly API Search with Enhanced Deduplication

Problem Solved

Solution

Setup

1. Install Dependencies

2. Configure Environment

3. Configure Search Query

Usage

Output

Output Files

How It Works

Key Features

Troubleshooting

Still seeing duplicates?

Performance tips

Technical Details

Support