fix(out): reconfigure stdout to UTF-8 on any non-UTF-8 terminal#1975
Open
bearomorphism wants to merge 3 commits intocommitizen-tools:masterfrom
Open
fix(out): reconfigure stdout to UTF-8 on any non-UTF-8 terminal#1975bearomorphism wants to merge 3 commits intocommitizen-tools:masterfrom
bearomorphism wants to merge 3 commits intocommitizen-tools:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1975 +/- ##
=======================================
Coverage 98.23% 98.24%
=======================================
Files 61 61
Lines 2779 2785 +6
=======================================
+ Hits 2730 2736 +6
Misses 49 49 ☔ View full report in Codecov by Sentry. |
15 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses crashes caused by UnicodeEncodeError when Commitizen prints Unicode characters to a non‑UTF‑8 sys.stdout (e.g., ISO‑8859‑1 locales), by normalizing stdout to UTF‑8 early and adding targeted regression tests.
Changes:
- Replace the Windows-only stdout reconfiguration with a platform-agnostic
_ensure_utf8_stdout()helper. - Invoke
_ensure_utf8_stdout(sys.stdout)atcommitizen.outimport time so all subsequent output uses the normalized stream. - Add a new
tests/test_out.pysuite covering UTF‑8 no-op behavior and reconfiguration for common non‑UTF‑8 encodings.
Reviewed changes
Copilot reviewed 2 out of 7 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
commitizen/out.py |
Introduces _ensure_utf8_stdout() and calls it at import time to avoid UnicodeEncodeError on non‑UTF‑8 stdout. |
tests/test_out.py |
Adds regression tests validating stdout reconfiguration logic across encodings and stream types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* fix docstring: errors=replace handles un-encodable inputs (lone surrogates etc), not 'terminal cannot render' which is an OS concern not the encoder's * tighten test comment to describe what's actually being verified Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Closes #956.
Why
Commitizen emits Unicode characters that fall outside the Latin-1 repertoire — most visibly the rocket emoji (
\U0001f680) inout.success("Configuration complete 🚀")and the typographic apostrophe (\u2019) scattered throughout the conventional-commits info text. When Python'sprint()encodes those characters throughsys.stdout, it uses whatever encoding the operating system assigned to the terminal. On a system whoseLANGis set tode_CH.ISO8859-1(or any other non-UTF-8 locale), that encoding is Latin-1, which cannot represent either character, so Python raisesUnicodeEncodeErrorand crashes the command mid-output.The original guard in
commitizen/out.py:7-9only reconfiguressys.stdoutto UTF-8 whensys.platform == "win32". Linux and macOS users on legacy ISO-8859-1 locales — the reporter ran AlmaLinux 8.9 and macOS Ventura 13.6.3, both withLANG=de_CH.ISO8859-1— received no such reconfiguration, so everyprint()call that contained an out-of-range character exploded. Maintainer @Lee-W acknowledged the reproduction in the issue thread; a verification comment on the triage audit (#1964) confirmed the code path is unchanged in master (v4.15.1).The fix extracts the reconfiguration into a small
_ensure_utf8_stdout()helper that is encoding-agnostic and platform-agnostic. It is called once at module-import time so that every subsequentprint()in every command —write,line,success,info,warn— is already operating on a UTF-8 stream before the first byte is written.What changed
commitizen/out.pysys.platform == "win32"guard (lines 7–9); add_ensure_utf8_stdout()helper; call it unconditionally onsys.stdoutat module importtests/test_out.pyUTF8), ISO-8859-1 reconfigure, cp1252 reconfigure, non-TextIOWrapperskip, and end-to-end emoji write-throughHow it works
Encoding normalisation before comparison. The helper computes
(stream.encoding or "").lower().replace("-", "").replace("_", "")before comparing against"utf8". This collapses the alias zoo —UTF-8,utf_8,UTF8— into a single canonical string, so the no-op path is taken for all valid UTF-8 spelling variants without maintaining an allowlist.errors="replace"as a safety net. Even after reconfiguration, a terminal that physically can't render a byte sequence (e.g., a very oldxtermwith a broken locale) will receive a?placeholder rather than raising a secondUnicodeEncodeError. Crashing mid-output after a user has already answered severalcz initprompts is a worse experience than a one-character substitution.Module-level call, not per-
printwrapping._ensure_utf8_stdout(sys.stdout)executes once whencommitizen.outis first imported. Because every CLI command importsoutbefore writing anything, the reconfiguration is guaranteed to be in place for the full lifetime of the process — no per-function guard is needed.isinstance(stream, io.TextIOWrapper)guard. Pipes,io.StringIOinstances, and pytest's stdout-capture objects are notTextIOWrappersubclasses and do not exposereconfigure(). Checking the type before calling keeps the helper safe in test environments and in any context wheresys.stdouthas been replaced by a non-standard object.AttributeError/ValueErrorare swallowed silently. A stream whose underlying buffer is already closed raisesValueError; a stream backed by a read-only file descriptor raisesAttributeError. Neither case should crash the import. The# pragma: no coverannotation marks these as genuine safety-net branches rather than untested logic.Backward compatibility
cp1252(the historical Windows case), so the reconfiguration that previously lived behind thewin32guard still happens.sys.stdout(the common case on modern Linux/macOS) hit the early-return branch — noreconfigure()call, no observable difference.tests/test_out.pyis a new file, so there are no test regressions.Checklist
Was generative AI tooling used to co-author this PR?
Generated-by: Claude following the guidelines
Code Changes
uv run poe alllocally to ensure this change passes linter check and testsExpected Behavior
LANG=de_CH.ISO8859-1 cz infoon Linux/macOSUnicodeEncodeError;\u2019(typographic apostrophe) passes through or is replaced by?LANG=de_CH.ISO8859-1 cz initcompletes successfullyConfiguration complete 🚀without crashing on\U0001f680LANG=en_US.UTF-8 cz info(UTF-8 terminal)_ensure_utf8_stdouthits the early-return branch and calls noreconfigure()TextIOWrapperstdout (e.g. piped, pytest capture)_ensure_utf8_stdoutreturns immediately; noAttributeErrorSteps to Test This Pull Request
Additional Context
This fix emerged from the cross-cutting audit in #1964, which confirmed the Windows-only guard in
commitizen/out.py:7-9was the sole cause and that the code path is unchanged in master (v4.15.1). The_ensure_utf8_stdouthelper was designed to be testable without monkey-patching the realsys.stdoutand reimporting the module, which is why it accepts an arbitrarystreamargument rather than hard-codingsys.stdoutinside the function body.