Skip to content

esp32/ota: Implement ESP-IDF OTA functionality.#7048

Open
ekondayan wants to merge 1 commit into
micropython:masterfrom
ekondayan:feature/ota
Open

esp32/ota: Implement ESP-IDF OTA functionality.#7048
ekondayan wants to merge 1 commit into
micropython:masterfrom
ekondayan:feature/ota

Conversation

@ekondayan
Copy link
Copy Markdown

@ekondayan ekondayan commented Mar 17, 2021

UPDATE: The test completed successfully on NodeMCU ESP32 by ai-thinker with ESP-IDF v4.0, v4.1, v4.2, v4.3

Implemented new functions in esp32.Partition:

  • mark_app_invalid_rollback_and_reboot()

  • check_rollback_is_possible()

  • app_description()

  • app_state()

  • ota_begin()

  • ota_write()

  • ota_write_with_offset() for ESP-IDF version >= 4.2

  • ota_end()

  • ota_abort() for ESP-IDF version >= 4.3

  • create tests

  • update documentation

For many commercial products, Over The Air updates are a very important and critical part. It must be reliable and should not brick the device. Writing a good OTA from scratch is a daunting task. For that reason the use of well tested and proven reliable libraries is much more preferable than the ones developed in the house.

USECASE

I'm developing an industrial device where the OTA is an essential part of it. Since only a few functions from esp-idf are
implemented (enough for hobby project but not enough for commercial project), I ended up duplicating the esp-idf ota functionality in python.
The result was a module with a questionable quality. I tried to predict all the possible places where it could crash, but my gut was
telling me that I could be missing something. So I decided to implement more of the the OTA functions from esp-idf and to use them in my OTA module.
I've rewritten my OTA module and replaced the redundant code with the implemented functions from esp-idf. This allowed me to reduced the size of the module significantly, increase the robustness of the code and on top of that now the code got much simpler and easier to maintain.
The total increase in size of the compiled app image is 2432 bytes.

IMPLEMENTATION

Extend the esp32.Partition class where all the OTA related functions are
prefixed with "ota_" and app related functions are prefixed with "app_".

Example:
from esp32 import Partition
app_part = Partition(Partition.RUNNING)
app_part.app_description()
app_part.app_state()
handle = app_part.ota_begin()
app_part.ota_end(handle)

New functions:
Partition.mark_app_invalid_rollback_and_reboot(cls)
Partition.check_rollback_is_possible(cls)
Partition.app_description(self)
Partition.app_state(self)
Partition.ota_begin(self, image_size = 0)
Partition.ota_write(self, handle_in, data_in)
Partition.ota_write_with_offset(self, handle_in, data_in, offset)
Partition.ota_end(self, handle_in)
Partition.ota_abort(self, handle_in) only for ESP-IDF version >= 4.3

BENEFITS

  • no code duplication
  • reduced code size
  • cleaner and easier to maintain code
  • esp-idf handles encrypted flash
  • use a reliable and well tested library written by the creators of ESP32
  • more robust OTA procedure
  • better performance

CONS

  • None :) (maybe the 2432 bytes overhead in the final compiled app image)

@ekondayan
Copy link
Copy Markdown
Author

ekondayan commented Apr 30, 2021

Anybody?
Is there any chance of this getting mainstream?

@ekondayan ekondayan force-pushed the feature/ota branch 2 times, most recently from e98f531 to a9ba869 Compare May 6, 2021 16:51
@ekondayan ekondayan force-pushed the feature/ota branch 2 times, most recently from 23025f2 to 5fc57a9 Compare May 14, 2021 17:25
@ekondayan ekondayan force-pushed the feature/ota branch 2 times, most recently from e353cfd to cdd760c Compare June 14, 2021 09:15
@ekondayan ekondayan force-pushed the feature/ota branch 2 times, most recently from baa1249 to 8d0268c Compare June 27, 2021 08:50
@ekondayan ekondayan force-pushed the feature/ota branch 3 times, most recently from 3ce6443 to bb49a3f Compare July 22, 2021 05:07
@ekondayan ekondayan force-pushed the feature/ota branch 2 times, most recently from ffbdab7 to e4d981c Compare September 16, 2021 07:14
@abraha2d
Copy link
Copy Markdown

abraha2d commented Dec 18, 2022

@ekondayan Thanks for keeping this up to date, I'm using this successfully in my project. Hopefully it can get mainlined soon.

@ekondayan
Copy link
Copy Markdown
Author

@abraha2d Thank you for your positive feedback. I am glad that it is useful to you.

@dpgeorge I've been keeping this PR up do date for almost 2 years. Is there a slight chance this PR to go mainline?

@ekondayan ekondayan force-pushed the feature/ota branch 3 times, most recently from cb5b5d3 to 7f51f67 Compare May 22, 2023 06:22
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 12, 2023

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.41%. Comparing base (be15be3) to head (1f86791).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7048   +/-   ##
=======================================
  Coverage   98.41%   98.41%           
=======================================
  Files         171      171           
  Lines       22324    22324           
=======================================
  Hits        21971    21971           
  Misses        353      353           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@glenn20
Copy link
Copy Markdown
Contributor

glenn20 commented Oct 8, 2023

Another requirement is to enable app rollback in the sdkconfig file

CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y

This is now enabled in the v1.21.0 release. See #12475 (sorry - I missed this PR when I was preparing that PR).

Note - if you are interested in OTA, I also have a micropython OTA tool at https://github.com/glenn20/micropython-esp32-ota which uses the existing support for OTA in micropython. There is also a separate tool at https://github.com/glenn20/mp-image-tool-esp32 which can (among other things) add OTA partition tables to micropython firmware images and flash storage on ESP32 devices.

@projectgus
Copy link
Copy Markdown
Contributor

This is an automated heads-up that we've just merged a Pull Request
that removes the STATIC macro from MicroPython's C API.

See #13763

A search suggests this PR might apply the STATIC macro to some C code. If it
does, then next time you rebase the PR (or merge from master) then you should
please replace all the STATIC keywords with static.

Although this is an automated message, feel free to @-reply to me directly if
you have any questions about this.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 20, 2026

Code size report:

Reference:  docs/mimxrt: Add docs for mimxrt.Flash. [be15be3]
Comparison: esp32/ota: Implement ESP-IDF OTA functionality. [merge of 1f86791]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32: +5008 +0.287% ESP32_GENERIC[incl +1792(data) +8(bss)]
     mimxrt:    +0 +0.000% TEENSY40
        rp2:    +0 +0.000% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

Copy link
Copy Markdown
Contributor

@projectgus projectgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping this PR up to date for so long, @ekondayan.

The state of OTA support in MicroPython has probably changed since you first opened the PR, but if I follow correctly MicroPython already has implemented OTA functionality for ESP32 and this PR adds three related areas of functionality:

  1. New APIs for reading app information from the partition.
  2. Additional rollback support for marking an update invalid or testing if rollback is possible.
  3. A wrapper around the ESP-IDF OTA write API rather than writing to the partition directly.

Part 1 makes sense to me as useful functionality for managing OTA updates in a deployed application. You could probably put these into a separate PR and we could merge them pretty quickly.

Part 2 I have some inline questions about, but the check_rollback_is_possible() function definitely seems useful to avoid reboot loops.

Part 3, I'm not certain what the exact benefits of an ota_* API is here compared to manually writing the partition and then calling part.set_boot() to set it as the next boot partition.

Is the main difference that the OTA API will verify the image? If so, could we add this as an optional boolean verify argument on set_boot() and get the same functionality? If there are any other benefits, can you please explain them?

Finally, the code size report suggests this PR adds 1.6KB of static RAM usage (+1632(data) +8(bss)]). I'm not sure how or why, there's no static buffer that I can see, but that's a big impact for ESP32s without PSRAM so it'd be good if there was a way to avoid it.

Comment thread docs/library/esp32.rst Outdated

.. method:: Partition.app_state()

Returns the app state of a valid ota partition. It can be one of the following strings:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more common pattern in MicroPython would be to return an integer from this function, and define these as constants on the Partition class - i.e. Partition.APP_STATE_NEW, Partition.APP_STATE_VERIFY, etc.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that aligns better with MicroPython conventions. I'll refactor app_state() to return the integer value directly and add the corresponding constants to the Partition class:

Partition.APP_STATE_NEW
Partition.APP_STATE_PENDING_VERIFY  
Partition.APP_STATE_VALID
Partition.APP_STATE_INVALID
Partition.APP_STATE_ABORTED
Partition.APP_STATE_UNDEFINED

Comment thread docs/library/esp32.rst
If the "CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE" option is set, and a reset occurs without
calling either
``mark_app_valid_cancel_rollback()`` or ``mark_app_invalid_rollback_and_reboot()``
function then the application is rolled back.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between calling this API, versus simply restarting without calling mark_app_valid_cancel_rollback()? If we can get the same behaviour without adding a new function here then it might be better to leave it out?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a meaningful difference. When you simply restart without calling mark_app_valid_cancel_rollback(), the rollback only happens on the next boot (the bootloader detects the "pending_verify" state persisted across two boots and then changes it to "aborted") - so it requires two boot cycles.
In contrast, mark_app_invalid_rollback_and_reboot() performs an immediate rollback in one boot cycle - it actively marks the current partition as invalid in otadata and reboots to the previous working partition in one atomic operation.

  1. Post-validation failures: If the app already called mark_app_valid_cancel_rollback() early in boot and later detects a critical issue (can't reach server, hardware fails under load, config corruption), passive rollback is no longer possible - the partition is already marked valid.
  2. Atomicity: A manual approach risks power loss between operations, leaving undefined state. The ESP-IDF function writes otadata and reboots as a single coordinated operation.

@ekondayan
Copy link
Copy Markdown
Author

Benefits of an ota_* is that they handle efuse values, checksums, encryption, etc. and they follow the best practices for manipulation the ota partition. These are mission critical functions and for a production devices I wouldn't trust myself reimplementing them in micropython.

TL;DR
ESP-IDF provide fast, small and reliable functions that cover all edge cases and follow best practices for esp32 devices.

  1. Image validation: ota_end() validates the complete image (magic bytes, checksums). With direct writes, you'd need to implement validation yourself.
  2. Secure boot signature verification: If secure boot is enabled, ota_end() verifies the firmware signature. This is non-trivial to replicate manually.
  3. Flash encryption handling: The OTA API handles encrypted flash transparently. Direct writes to an encrypted partition require manual encryption handling.
  4. Rollback state initialization: The OTA API properly sets the new partition's state to "new" in otadata, which triggers the rollback state machine on next boot. Direct writes + set_boot() might not initialize this correctly.
  5. Anti-rollback/secure version: If anti-rollback is enabled, the OTA API validates that the new firmware's secure_version >= efuse value.
  6. Out-of-order writes: ota_write_with_offset() allows non-contiguous writes when network packets arrive out of order - useful for UDP-based OTA.
  7. Clean abort: ota_abort() properly cleans up state if the update fails mid-way.
  8. verify on set_boot() isn't sufficient: Validation needs to happen after all writes complete but before marking bootable. The handle-based API (ota_begin → ota_write → ota_end) provides this atomic commit-or-abort semantic. By the time you call set_boot(), you've already written potentially invalid data.

Add OTA update support to the Partition class, including:
 - ESP-IDF OTA API wrappers.
 - app metadata inspection
 - rollback management

Implemented new functions:
* mark_app_invalid_rollback_and_reboot()
* check_rollback_is_possible()
* app_description()
* app_state() returns integer, use APP_STATE_* constants
* ota_begin()
* ota_write()
* ota_write_with_offset() for ESP-IDF version >= 4.2
* ota_end()
* ota_abort() for ESP-IDF version >= 4.3

Added APP_STATE_* constants to Partition class:
* APP_STATE_NEW
* APP_STATE_PENDING_VERIFY
* APP_STATE_VALID
* APP_STATE_INVALID
* APP_STATE_ABORTED
* APP_STATE_UNDEFINED

* create tests
* update documentation
@projectgus
Copy link
Copy Markdown
Contributor

Hi @ekondayan,

Thanks for getting back to me.

  • Image validation: ota_end() validates the complete image (magic bytes, checksums). With direct writes, you'd need to implement validation yourself.
  • Secure boot signature verification: If secure boot is enabled, ota_end() verifies the firmware signature. This is non-trivial to replicate manually.

There's a function esp_image_verify() which we can call directly (it's what esp_ota_end() is calling internally anyhow).

  • Flash encryption handling: The OTA API handles encrypted flash transparently. Direct writes to an encrypted partition require manual encryption handling.

The ESP-IDF OTA API is calling the same esp_partition_write() functions that MicroPython already calls, so this part is the same.


Before I keep answering these points, I'm sorry but I have to ask an important question: did you use generative AI (like ChatGPT) to respond to my review? I don't mean to be rude, but many of the answers have the "confident and mostly but not totally correct" style of something written by an LLM. I need to know that you're putting in time and effort into creating these comments, before I put more of my own time and effort into responding to them.

@ekondayan
Copy link
Copy Markdown
Author

Before I keep answering these points, I'm sorry but I have to ask an important question: did you use generative AI (like ChatGPT) to respond to my review? I don't mean to be rude, but many of the answers have the "confident and mostly but not totally correct" style of something written by an LLM. I need to know that you're putting in time and effort into creating these comments, before I put more of my own time and effort into responding to them.

Of course I use LLM. It would be weird if I don't. You are not rude at all to question this. I myself avoid investing effort and energy unless I see a reasonable and thinking AI or NI(Natural Intelligence) on the other end. I don't use LLM to generate my response, just to review and polish. The technical reasoning is mine.

There's a function esp_image_verify() which we can call directly (it's what esp_ota_end() is calling internally anyhow).

You're right that esp_image_verify() exists and the same write functions are used internally but my concern is about the complete workflow.

I see where this leads - you want to save few bytes and rely on the developers to implement the correct sequence. My argument for that is: for OTA enabled devices, this is the one of the most important features, because if not implemented correctly, you can render the device useless and end its life prematurely. Imagine if you sell or deploy thousands of devices which can get bricked easily by a tiny human mistake. It depends on who you expect to use MicroPython. If you are targeting DIY hobbyists who primarily use it for watering their plants, you can take the risk and rely on the developers to implement the correct workflow manually, but if you want to ship commercial devices that sit in factories where downtime is unacceptable, you can not rely solely on devs to implement it correctly. I wouldn't trust even myself for that. It is too risky to deviate from the official battle tested workflow recommended by Espressiff.

I can make this feature optional via a flag in mpconfigport.h. Users who need the full OTA workflow can just enable it.

#ifndef MICROPY_PY_ESP32_PARTITION_OTA_EXTENDED
#define MICROPY_PY_ESP32_PARTITION_OTA_EXTENDED (1)
#endif

What do you think?

TL;DR
Manually calling write -> verify -> set_boot is error-prone and requires reading the IDF documentation carefully. On the other hand ESP-IDF OTA API enforces the correct sequence and state management.

P.S.
This comment is 100% human generated - though I'm curious to see if my writing style has been influenced by reading too much LLM output over the last few years

@projectgus
Copy link
Copy Markdown
Contributor

for OTA enabled devices, this is the one of the most important features,

We agree about this point, and certainly I'd be happy to see robustness improvements to the esp32 OTA system in MicroPython (not least because I know for a fact it used in some commercial deployments now.)

The question is about how to achieve them. As MicroPython maintainers we have to balance a number of factors in order to find the best outcome. It's not simple from our perspective.

you can not rely solely on devs to implement it correctly. I wouldn't trust even myself for that. It is too risky to deviate from the official battle tested workflow recommended by Espressiff.

This is true even if we wrap the OTA API from Espressif, and is why we have tests/ports/esp32/partition_ota.py.

I can make this feature optional via a flag in mpconfigport.h. Users who need the full OTA workflow can just enable it.
What do you think?

This isn't a great solution unless we can enable it on most boards, because disabled-by-default features are hard to test and tend to bitrot. It'd also leave us in the situation of having an existing, documented, less robust OTA system (the current one) and a better one gated behind a compile flag. This itself is error-prone for developers.

BTW, if the 1.6KiB static memory usage I noted is unavoidable then we would have to leave it disabled on most boards. Do you have any idea where that comes from?

Manually calling write -> verify -> set_boot is error-prone and requires reading the IDF documentation carefully

Not sure I agree. If we changed the set_boot API call to be set_boot(verify=True) then the default can easily do the same checks as ota_end(), and this would work even for existing MicroPython OTA update code which is a big win for existing users.

As far as I can see, the difference of "reading the IDF documentation carefully" comes down to manually erasing the region and then calling the partition write function which isn't particularly complex (and we could wrap this in a micropython-lib module to appear like a file object as well, if we had to).

Am I missing something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants