Python: Unicodedecodeerror: 'utf-8' Codec Can't Decode Byte 0x80 In Position 0: Invalid Start Byte

October 11, 2024 Post a Comment

I am fetching data from a catalog and it's giving data in bytes format. Bytes data: b'\x80\x00\x00\x00\n\x00\x00%\x83\xa0\x08\x01\x00\xbb@\x00\x00\x05p \x02\x00>\xf3\x00\x00\x0

Solution 1:

The UTF-8 encoding has some built-in redundancy that serves at least two purposes:

1) locating code points reading back and forth

Start bytes (in binary dots carrying actual data) match one of these 4 patterns

0.......110.....1110....11110...

whereas continuation bytes (0 to 3) have always this form

10......

2) checking for validity

If this encoding is not respected, it is safe to say that it is not UTF-8 data, e.g. because corruptions occurred during a transfer.

Conclusion

Why is it possible to say that b'\x80\' cannot be UTF-8? Already at the first two bytes the encoding is violated: because 80 must be a continuation byte. This is exactly what your error message says:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

And even if you skip this one, you get another problem some bytes later at b'%\x83', so it's most likely that either you are trying to decode the wrong data or assume the wrong encoding.

Solution 2:

The data in your example is clearly not text in any common encoding. Neither Python nor we can figure out a way to turn data which is obviously not text into strings.

If this is a well-defined binary file format, find a parser for this format (ideally a popular Python library, but for more obscure or proprietary formats you may not be able to find one) or write one yourself if you can figure out how the data is structured, either by clever experimentation and good guesswork, or by finding (if not authoritative then perhaps more or less speculative third-party) documentation.

If you simply want to turn the bytes into a string of code points with the same Unicode code points (so that for example the input byte \xff maps to the Unicode code point U+00FF), the 'latin-1' encoding does this, obscurely but conveniently. The result in this case will obviously not be useful human-readable text; in many ways, it would then be more natural and quite possibly less error-prone and more convenient to just keep the data as bytes instead.

Solution 3:

For this encoding error

UnicodeDecodeError:'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

or other like that, you just have to open the database file with .json extension and change the encoding to UTF-8 (for exemple in VScode, you can change it in right-bottom nav-bar) and save the file...

Now run

 $ git status

you'll have something like this result

 On branch master
 Changes not staged for commit:
   (use"git add <file>..." to update what will be committed)
   (use"git restore <file>..." to discard changes in working directory)
        modified:   store/dumps/store.json
   (use"git add <file>..." to include in what will be committed)
        .gitignore

 no changes added to commit (use"git add"and/or"git commit -a")

or something like this one

On branch master
Changes to be committed:
  (use "git restore --staged <file>..."to unstage)
        modified:   store/dumps/store.json
Untracked files:
  (use "git add <file>..."to include in what will be committed)
        .gitignore

for the first case, you just have to do this one

$ git add store/dumps/

the second case don't need this previous part...

Now, for the two cases, you have to commit the changes with

$ git commit -m "launching to production"

the console will return you a message informed you for the adds and changes...

You have to build log for the app again with

$ git push heroku master

(for heroku users)

after the build, you just have to load the database again with

heroku run python manage.py loaddata store/dumps/store.json

it will install the objects./.

excuses for my english level !!!

Python Courses, Training, and Tutorials