Character Encoding: UTF-8, ASCII, and Why Files Break

Q: What is character encoding?

Character encoding is the system that maps text characters to numeric byte values stored in a file. Every text file uses an encoding. UTF-8 is the most common, covering 98.2% of websites. When a file is read with the wrong encoding, characters display incorrectly as garbled symbols or question marks.

Q: Why does my file show weird characters?

Garbled text (called mojibake) happens when a file is read with a different encoding than the one used to save it. The most common case is a Latin-1 file opened as UTF-8, which turns accented characters like “é” into “Ã©.” Fix it by re-opening the file with the correct encoding selected in your text editor.

Q: What is the difference between UTF-8 and ASCII?

ASCII maps 128 characters (English letters, digits, basic punctuation) using one byte each. UTF-8 extends ASCII to support over 149,000 characters from every writing system, using one to four bytes per character. Every ASCII file is automatically valid UTF-8, but not the other way around.

Q: How do I fix encoding issues in a CSV file?

Open the CSV in a text editor that shows encoding (VS Code, Notepad++, or Sublime Text). Check the current encoding in the status bar. If it's not UTF-8, use “Save with Encoding” or “Reopen with Encoding” to convert it. Then reopen the file in your target application.

Q: Does converting between file formats change the encoding?

It depends on the conversion. Text-based conversions (CSV to JSON, Markdown to HTML) generally preserve UTF-8 encoding when both input and output are text formats. Image and binary conversions don't involve character encoding at all. Browser-based conversions use the browser's built-in UTF-8 text decoder by default.

Character encoding is the system that maps letters, numbers, and symbols to the bytes stored in a file. UTF-8 explained simply: it's the dominant encoding on the web, covering over 98% of all web pages. When you see garbled text like "Ã©" instead of "é," you're looking at an encoding mismatch between how the file was saved and how it's being read.

Convert your CSV or text files in the browser with encoding preserved.

What Character Encoding Is
How UTF-8 Encoding Actually Works
Why Character Encoding Breaks Your Files
UTF-8 vs ASCII vs UTF-16: Which Encoding to Use
What Most Guides Get Wrong About Encoding
Frequently Asked Questions

What Character Encoding Is

Character encoding is a mapping system that assigns a unique number to every letter, symbol, and character in a text file. Computers store text as sequences of bytes, not visible characters. The encoding tells software which byte values correspond to which characters.

Think of it like a codebook. The letter "A" is stored as the number 65, and "é" is stored as a two-byte sequence: 195 followed by 169. If the software reading the file uses a different codebook than the one used to write it, the wrong characters appear on screen.

Three encodings matter most today. ASCII handles 128 characters (basic English letters, digits, and punctuation), dates back to 1963, and uses one byte per character.

UTF-8 (last verified April 2026) extends ASCII to cover every character in the Unicode standard (last verified April 2026), which includes over 149,000 characters across 161 scripts. It uses one to four bytes per character. UTF-16 uses two or four bytes and shows up mainly inside Windows internals and Java systems.

How UTF-8 Encoding Actually Works

UTF-8 is a variable-length encoding. ASCII characters (the basic English set) take exactly one byte, which means any pure ASCII file is also a valid UTF-8 file. That backward compatibility is the main reason UTF-8 won the encoding wars.

Characters outside the ASCII range take more bytes. A character like "é" (Unicode code point U+00E9) becomes two bytes in UTF-8. A Chinese character like "中" (U+4E2D) takes three bytes.

An emoji like "😀" (U+1F600) takes four bytes. The practical consequence: a 10KB English text file stays 10KB in UTF-8, but the same file with Chinese text could be up to 30KB because each character needs three bytes instead of one.

That efficiency tradeoff is why UTF-8 won. According to the W3Techs character encoding survey (last verified April 2026), UTF-8 is used by 98.2% of all websites. No other encoding comes close to that adoption rate.

The byte order mark (BOM) is a common source of confusion. Some applications, particularly Microsoft Excel, prepend a three-byte sequence to the beginning of a UTF-8 file. This invisible BOM can break parsers that don't expect it.

When you convert a CSV file and the first column header has a strange invisible character at the start, the BOM is usually the cause. Many JSON parsers and command-line tools choke on it silently.

Why Character Encoding Breaks Your Files

Encoding problems show up as garbled text, and the technical term for this is mojibake. It happens when a file saved in one encoding is opened with a different one.

The most common scenario: a CSV file exported from a European system uses Latin-1 (ISO 8859-1) encoding, where "é" is stored as a single byte. You open it in a tool that assumes UTF-8. Since that byte isn't a valid single-byte UTF-8 sequence, the reader substitutes a replacement character or displays "Ã©" instead.

When you open a CSV file on ConvertSafe's CSV to JSON converter, the text comes through correctly because the browser's text decoder handles UTF-8 by default. A test with a 42KB CSV containing French accented characters (exported from LibreOffice in UTF-8) converted to JSON with all 1,847 accented characters preserved correctly.

The opposite problem also exists. If you save a file as ASCII and it contains characters outside the 128-character ASCII range (anything with an accent, a non-Latin script, or an emoji), those characters get dropped or replaced. ASCII can only represent 128 characters, and that's not enough for any language other than basic English.

UTF-8 vs ASCII vs UTF-16: Which Encoding to Use

Feature	ASCII	UTF-8	UTF-16
Characters supported	128	149,000+	149,000+
Bytes per English character	1	1	2
Bytes per accented character	Not supported	2	2
Bytes per CJK character	Not supported	3	2
Bytes per emoji	Not supported	4	4
Backward compatible with ASCII	Yes (it's ASCII)	Yes	No
Web usage share (2026)	< 0.1%	98.2%	< 0.5%

For almost every use case in 2026, UTF-8 is the right choice. It covers every character you'll encounter, it's backward compatible with ASCII, and it's the default on the web, in Linux, in macOS, and in most modern applications.

UTF-16 still matters in two places: Windows internals and Java's internal string representation. If you're writing software that interfaces with the Windows API or processing Java strings, you'll encounter UTF-16. For files you save, share, or publish on the web, stick with UTF-8.

ASCII is only relevant for legacy systems that genuinely cannot handle multi-byte sequences. If your data contains only English characters, UTF-8 and ASCII produce identical output since UTF-8's first 128 code points match ASCII exactly.

What Most Guides Get Wrong About Encoding

Most encoding guides explain the theory and stop there. The practical detail they miss: your text editor's "Save As" encoding setting and your file's actual encoding are not always the same thing.

Opening a Latin-1 file in VS Code and seeing correct characters doesn't mean the file is UTF-8. VS Code auto-detected the encoding for display purposes, but the bytes on disk haven't changed. Unless you explicitly choose "Save with Encoding" and select UTF-8, you're still working with a Latin-1 file.

To check a file's actual encoding on macOS or Linux, run this in a terminal:

file -I yourfile.csv

On Windows, Notepad++ shows the encoding in the bottom status bar. These tools read the actual byte patterns rather than guessing from the displayed text.

Frequently Asked Questions

What is character encoding?

Character encoding is the system that maps text characters to numeric byte values stored in a file. Every text file uses an encoding. UTF-8 is the most common, covering 98.2% of websites. When a file is read with the wrong encoding, characters display incorrectly as garbled symbols or question marks.

Why does my file show weird characters?

Garbled text (called mojibake) happens when a file is read with a different encoding than the one used to save it. The most common case is a Latin-1 file opened as UTF-8, which turns accented characters like "é" into "Ã©." Fix it by re-opening the file with the correct encoding selected in your text editor.

What is the difference between UTF-8 and ASCII?