Skip to content

fix: preserve colspan and rowspan in HTML table cleaning#1987

Open
cgseyhan wants to merge 1 commit into
unclecode:mainfrom
cgseyhan:fix/table-colspan-rowspan
Open

fix: preserve colspan and rowspan in HTML table cleaning#1987
cgseyhan wants to merge 1 commit into
unclecode:mainfrom
cgseyhan:fix/table-colspan-rowspan

Conversation

@cgseyhan
Copy link
Copy Markdown

What's the Issue?

During the HTML sanitization and cleaning phase, essential structural table attributes (colspan and rowspan) were inadvertently being stripped away.

Why does this matter?
The removal of these attributes completely breaks the structural integrity of complex tables. This leads to malformed Markdown generation and data loss during the scraping process, making table extraction unreliable.


The Solution

To fix this, I updated the global sanitization rules to explicitly preserve table layout attributes.

Changes Made:

  • Added "colspan" and "rowspan" to the IMPORTANT_ATTRS list inside crawl4ai/config.py.
  • Ensured that the BeautifulSoup-based cleaner retains these crucial attributes for <td> and <th> elements during the DOM processing phase.

Testing & Verification

I implemented robust testing to ensure this bug doesn't return:

  • New Unit Test Added: tests/test_html_table_colspan_rowspan.py
  • Simulated Crawl Test: The test uses AsyncWebCrawler with a raw: HTML string containing a complex table setup.
  • Assertion check: Verified that the final sanitized HTML perfectly retains the colspan="2" and rowspan="2" attributes.
  • Local Test Status: All tests are currently passing!

This commit adds colspan and 
owspan to IMPORTANT_ATTRS in config.py so that these table structure attributes are preserved during sanitization. Also added a unit test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant