Skip to content

fix(config): preserve unicode characters when writing yaml config#1966

Open
bearomorphism wants to merge 2 commits intocommitizen-tools:masterfrom
bearomorphism:fix/1164-yaml-allow-unicode
Open

fix(config): preserve unicode characters when writing yaml config#1966
bearomorphism wants to merge 2 commits intocommitizen-tools:masterfrom
bearomorphism:fix/1164-yaml-allow-unicode

Conversation

@bearomorphism
Copy link
Copy Markdown
Collaborator

@bearomorphism bearomorphism commented May 9, 2026

Description

Closes #1164.

Why

PyYAML's yaml.dump defaults to ASCII-only output: any non-ASCII codepoint is escaped using Python's \Uxxxx notation. This means a .cz.yaml containing a literal emoji — for example 🚀 in bump_message — is silently mangled every time cz bump rewrites the config, replacing the readable character with \U0001F680.

Reported by @syepes on commitizen 3.27.0 / Python 3.12 / macOS (#1164), with before-and-after screenshots showing the escape introduced by cz bump --increment PATCH. A triage note from the open-issues audit (2026-05-09) confirmed the bug still reproduces on master (v4.15.1): after cz bump, the bump_message key reads "\U0001F680 chore(release): …" instead of "🚀 chore(release): …".

The root cause is that both yaml.dump call sites in commitizen/config/yaml_config.pyinit_empty_config_content (line 33) and set_key (line 66) — omit allow_unicode=True. PyYAML documents that this flag causes non-ASCII codepoints to be written as literal UTF-8 bytes rather than escape sequences; the output remains valid YAML and valid UTF-8.

What changed

File Change
commitizen/config/yaml_config.py Add allow_unicode=True to yaml.dump in init_empty_config_content (line 33) and in set_key (line 66)
tests/test_conf.py Add test_set_key_preserves_unicode (regression: emoji survives a set_key round-trip) and test_init_empty_config_content_passes_allow_unicode (spy: keyword argument is forwarded to yaml.dump)

How it works

  • set_key (commitizen/config/yaml_config.py:58–68): reads the config with yaml.load, mutates the target key in the parsed dict, then re-serialises with yaml.dump. Without allow_unicode=True, any non-ASCII character already present in the file is escaped on write-back — producing the \U0001F680 symptom. Adding the flag instructs PyYAML to emit those characters as literal UTF-8 bytes.
  • init_empty_config_content (commitizen/config/yaml_config.py:29–33): writes {"commitizen": {}} in append mode during cz init. Its payload is currently ASCII-only, so allow_unicode=True has no observable effect today. It is added for consistency and to guard against future default content that might include non-ASCII characters.
  • Test strategy: test_set_key_preserves_unicode writes a YAML file containing 🚀, calls set_key("version", "0.1.1"), and asserts both that 🚀 survives and that \U0001F680 does not appear — this is the direct regression test for cz bump - Overwrites my cz.yaml emoji #1164. test_init_empty_config_content_passes_allow_unicode uses mocker.spy(yaml, "dump") to assert the keyword is forwarded rather than testing round-trip behaviour that does not currently occur (see the review note in Additional Context).
  • Why not a custom Dumper? PyYAML's allow_unicode flag is the documented, single-argument way to opt out of ASCII escaping. A custom Dumper subclass would be substantially more complex and carries a maintenance burden with no additional benefit for this use case.

Backward compatibility

  • YAML files that already contain \Uxxxx escape sequences are normalised on the next cz bump: yaml.load decodes the escape to the Unicode character on read, and the subsequent yaml.dump re-emits it as a literal codepoint. No data is lost and no manual migration is needed.
  • All 20 existing TestYamlConfig tests in tests/test_conf.py pass without modification.
  • No change to the public API, CLI flags, exit codes, or any config backend other than YAMLConfig.

Checklist

Was generative AI tooling used to co-author this PR?

  • Yes (please specify the tool below)

Generated-by: Claude following the guidelines

Code Changes

  • Add test cases to all the changes you introduce
  • Run uv run poe all locally to ensure this change passes linter check and tests
  • Manually test the changes (see "Steps to Test" below)
  • Update the documentation for the changes

Expected Behavior

Scenario Outcome
.cz.yaml has bump_message: "🚀 chore: bump …" and cz bump is run Emoji 🚀 is preserved verbatim; \U0001F680 does not appear
.cz.yaml already contains an escaped \U0001F680 and cz bump is run PyYAML normalises the escape on read and re-emits the literal 🚀 — file is healed in place
cz init creates a new .cz.yaml Output is ---\ncommitizen: {}\n, identical to the pre-fix behaviour

Steps to Test This Pull Request

git fetch fork fix/1164-yaml-allow-unicode
git checkout fork/fix/1164-yaml-allow-unicode

# 1. Targeted regression test.
uv run pytest tests/test_conf.py::TestYamlConfig::test_set_key_preserves_unicode \
              tests/test_conf.py::TestYamlConfig::test_init_empty_config_content_passes_allow_unicode -v

# 2. Reproduce the bug, then verify the fix.
mkdir /tmp/cz-1164 && cd /tmp/cz-1164
git init -b main
git config user.name test && git config user.email test@example.com
cat > .cz.yaml << 'EOF'
commitizen:
  name: cz_conventional_commits
  version: 0.1.0
  tag_format: "v$version"
  bump_message: "🚀 chore(release): bump $current_version to $new_version"
EOF
git add .cz.yaml && git commit -m "feat: initial"
cz bump --increment PATCH --yes
grep bump_message .cz.yaml
# Expected:  bump_message: "🚀 chore(release): ..."
# NOT:       bump_message: "\U0001F680 chore(release): ..."

Additional Context

This fix was identified during the open-issues audit tracked in #1964. A triage note (@bearomorphism, 2026-05-09) confirmed the bug reproduces on master (v4.15.1) and pinpointed commitizen/config/yaml_config.py:66 as the root cause, with a symmetric fix needed at line 33 for consistency.

Review note — test update: the initial draft included test_init_empty_config_content_uses_allow_unicode, which pre-seeded a YAML file with 🚀 and asserted the emoji survived init_empty_config_content(). This was misleading: because init_empty_config_content opens in append mode and writes only {"commitizen": {}}, the assertion would pass regardless of whether allow_unicode=True was present. The test has been replaced with a mocker.spy-based assertion that directly verifies the keyword argument is forwarded, without pretending to test behaviour that doesn't currently happen (see PR comment).

@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.23%. Comparing base (4b93a50) to head (56bca18).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1966   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files          61       61           
  Lines        2779     2779           
=======================================
  Hits         2730     2730           
  Misses         49       49           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yaml.dump() defaults to ASCII-only output, which causes `cz bump` (and
`cz init`) to rewrite emoji and other non-ASCII characters in `.cz.yaml`
as `\Uxxxx` escape sequences. Pass `allow_unicode=True` so the original
characters round-trip.

Closes commitizen-tools#1164

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bearomorphism bearomorphism force-pushed the fix/1164-yaml-allow-unicode branch from 855b8dd to b404143 Compare May 9, 2026 12:32
@bearomorphism
Copy link
Copy Markdown
Collaborator Author

Updated per @bearomorphism's review:

You're right — the original test_init_empty_config_content_uses_allow_unicode was misleading. It pre-seeded the file with an emoji and asserted the emoji survived init_empty_config_content(), but the method only writes {"commitizen": {}} (ASCII-only) in append mode, so allow_unicode=True had no observable effect on the test outcome.

Replaced it with a more honest test that uses mocker.spy(yaml, "dump") to assert the keyword argument is passed. This protects against future regressions if anyone ever adds a Unicode default to the dict, without pretending to test behaviour that doesn't currently happen.

Both test_set_key_preserves_unicode (which covers the actual #1164 regression) and the new spy-based test pass.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes YAML config rewrites so non-ASCII characters (e.g., emojis in bump_message) are preserved as literal Unicode instead of being escaped to \Uxxxxxxxx sequences when cz bump updates .cz.yaml/cz.yaml.

Changes:

  • Pass allow_unicode=True to yaml.dump in YAMLConfig.init_empty_config_content and YAMLConfig.set_key.
  • Add a regression test ensuring set_key does not introduce \U... escapes for emojis.
  • Add a spy-based test asserting init_empty_config_content forwards allow_unicode=True to yaml.dump.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
commitizen/config/yaml_config.py Enables Unicode emission in YAML serialization by adding allow_unicode=True to yaml.dump call sites.
tests/test_conf.py Adds regression + forwarding tests to prevent reintroducing YAML Unicode escaping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread commitizen/config/yaml_config.py Outdated
Comment thread commitizen/config/yaml_config.py Outdated
Comment thread tests/test_conf.py Outdated
* rename json_file -> yaml_file context-manager var

* force UTF-8 for YAML writes (YAML 1.2 spec mandates UTF-8/16/32) to avoid UnicodeEncodeError when self._settings.encoding is non-UTF-8

* tighten regression test to reject any \Uxxxxxxxx escape (case-insensitive)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cz bump - Overwrites my cz.yaml emoji

2 participants