Home > Uncategorized (无大类) > Valid UTF-8 data (hex:) followed by invalid UTF-8 sequence

Valid UTF-8 data (hex:) followed by invalid UTF-8 sequence

April 9th, 2007

OK, this one is a bit geeked out again, but it’s relevant to China. If you’re an american, you could probably go your entire life without ever bumping into codepages, but if you’re life crosses paths with asia, you almost certainly will…

As we’re developing a new website,doing our subversion (version control system) check-in, I started bumping into a very unusual error.

ryan@116843:/spike/public/news/app/webroot/redv1.0/img/menu$ sudo svn up svn: Valid UTF-8 data (hex:) followed by invalid UTF-8 sequence (hex: b8 b4 bc fe)

Unfortunately, google didn’t come up with much. The best hit was a Oct 10th post on the subversion users mailing list. Basically, the answer is that there’s no answer.

Well, I did an svn up in each child directory of the one causing the problem and eventually tracked the error down through my project’s directory tree. It looks like one of the guys using a windows system copied a JPEG with a Chinese GBK encoded filename onto the server. Everything is best kept in UTF-8.

Once finding the right file, you have to figure out how to delete a file with a name that can’t be typed…

ryan@116843:/spike/public/news/app/webroot/redv1.0/img/menu$ ls
logo02.jpg       ???? logo.jpg  menu_acc_down.jpg      menu_home_down.jpg  menu_work_down.jpg
logo03.jpg       logo.jpg       menu_acc.jpg           menu_home.jpg       menu_work.jpg
logo04.jpg       logo_top1.jpg  menu_cameras_down.jpg  menu_len_down.jpg
logo05.jpg       logo_top2.jpg  menu_cameras.jpg       menu_len.jpg
logo06.jpg       logo_top3.jpg  menu_gall_down.jpg     menu_tech_down.jpg
logo_bottom.jpg  logo_top.jpg   menu_gall.jpg          menu_tech.jpg

In this case, I just used: rm *\ logo.jpg since there was only one file matching this pattern… Next, I could commit again!

ryan@116843:/spike$ sudo svn up D public/.htaccess Updated to revision 38.

ryan Uncategorized (无大类)

  1. May 4th, 2007 at 01:24 | #1

    Thanks, that was just my problem!

    (no other results suggested filename issues :( )

    I wrote a tiny script to enumerate though the directory, outputing the path then running ’svn status’ on each one to find the culprit. (as found no way to get svn to output which folder it was about to try before doing it – so it would show before the ‘helpful’ error message)

  2. May 4th, 2007 at 14:59 | #2

    Thanks! I also ran into this problem, and could not see anyone coming up with a solution. Actually thought the problem laid within the files – thus deleting the ones making trouble would fix the problem. Good thing I spotted your blog first :-)

  3. May 9th, 2007 at 14:27 | #3

    Thanks for this post. You saved me quite a bit of time.

  4. June 29th, 2008 at 10:50 | #4

    Thanks, you saved me a bit of time. You can run into problems importing invalid utf too, I was importing a Wordpress sitemap plugin that gave me problems. The last filename before the error was the folder of files that were invalid.

    Hope this helps some one.

  5. October 1st, 2008 at 06:45 | #5

    I’ve juste had this problem. I force UTF-8 encode to all the last file i change and it’s works !

  6. November 3rd, 2008 at 11:11 | #6

    Too bad you’ve got a lot of spam on here, but the answer here was perfect. Thanks.

  7. ryan
    November 3rd, 2008 at 11:33 | #7
    Too bad you’ve got a lot of spam on here, but the answer here was perfect. Thanks.

    I’ve just gone through and cleaned up some of the remaining SPAM, but I must admit that I have no clue where some of it comes from… To submit a comment you should be required to fill out the RE-CAPTSHA field, but it seems that some spammers have found a way around this. Since installing RE-CAPTSHA the SPAM has slowed down dramatically, but it does still arrive and the rate does seem to be increasing again.

    Maybe there’s a backdoor in re-CAPTSHA?

  8. evdsande
    November 11th, 2008 at 05:21 | #8

    I ran into the same error, however I cannot find the file causing the error. I even now have the same error on all my repositories, even the ones that were not affected. I still don;t have a clue how to recover my repositories I already removed the malicious project from the affected repository using svnfilter with no effect. Is there a way to rebuild the repository form the dump and identfy the malicious directory or file???

    Regards Eric

  9. ryan
    November 11th, 2008 at 05:26 | #9

    @Eric-

    Eric-

    Your best bet is to go through each of the subdirectories of your project and do an “svn up” on the directory one at a time until you find the subdirectory(s) containing the file(s) that have an incompatible encoding.

    Another useful tool is the Unix “file” command. Just run “file *” in the directory that you locate with the incompatible encoding and search for the file that comes back to you with an encoding type other than UTF-8.

    Best of luck-

    -R

  10. evdsande
    November 11th, 2008 at 06:28 | #10

    Hi Ryan,

    I did the exercise as you suggested, but I don’t see the UTF-8 encoded types. The file * command gives me “ASCII C++ program text, with CRLF line terminators”, “XML document text” etc… but no UTF-8, besides when I ran into these trouble I removed the latest directory that created this error from my repository, my current repository contains only those parts that where there before the problem popped up, however the problem still exists. The problem even pop’s up on repositories that were never touched?? I switched to an other svn client, tried the commandline and even created a new repository to test, all of them give me the same error and now I’m totally puzzeld and stuck with a messed-up subversion installation.

    this is the part of the logging from tortoisesvn where the error popped up for the first time.

    Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\IBugTraqProvider.idl Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\IBugTraqProvider_i.c Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\IBugTraqProvider_h.h Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\inc\Interop.BugTraqProvider.dll Adding : C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin\issue-tracker-plugins.txt Error : Valid UTF-8 data Error : (hex:) Error : followed by invalid UTF-8 sequence Error : (hex: c0 a4 01) Finished! : 52 kBytes transferred in 0 minute(s) and 2 second(s)

    I removed the directories and files added here using svnadmin dump and svnfilter, however in the new repository the error still persists (I removed most of the revisions that handled the above actions, but not all of them could be removed). What I cannot understand is, how this can affect a different repository?

  11. ryan
    November 11th, 2008 at 07:05 | #11

    @Eric-

    I understand your frustration buddy. Just trying to help. You’re not looking for files that ARE UTF-8. You’re looking for files that ARE NOT UTF-8.

    In your case, just run: cd C:\Projects\Visual Studio 2005\Projects\TortoiseRedminePlugin svn up inc svn up … (where “…” is another folder under TortiseRedminePlugin)

    You’ll notice that this will run successfully for most of your files, but there will be some that don’t complete successfully. When you locate the folder where the error starts, then you run “svn up” one file at a time until you find the individual file(s) that are causing the problem. Delete those files from the repository and you’ll be all fixed up.

    As for the UTF-8 issue… Again – the key is to find files that are in encodings OTHER THAN UTF-8. You’ve got such a file – it’s just a matter of finding it.

  12. evdsande
    November 11th, 2008 at 07:41 | #12

    Hi Ryan,

    Sorry if I sounded offending, It’s subversion that’s frustrating me, giving me no decent clue where to look. I appreciate your help very much. Thanks a lot ! I will go again through my files, but how about the project I removed from the repository, should I add it again to the repository to go though all the files in there ?

  13. evdsande
    November 11th, 2008 at 12:30 | #13

    Finally I rebuild subversion with a patch that showed me the directory that was causing the trouble (found the patch at: http://www.nabble.com/-PATCH–Issue–2748:-non-UTF-8-filenames-in-the-repository-td19531299.html) however it didn’t help me yet. The error I get now is: Adding D:\Projects\t\TextDocument.txt Error Error converting entry in directory Error ‘/dtm/home/svn/svn/t/db/transactions/0-0.txn’ to UTF8 Error Valid UTF-8 data Error (hex:) Error followed by an invalid UTF-8 sequence Error (hex: c0 a4 01)

    Now there are at least 2 things I don’t get 1: The dir points is a subversion repository transactionlog file not the project file I’m trying to import 2: This is a new repositories, I deleted all my old repositories and did a fresh import of a single text file edited with vi

    And I still get the same error ???? I’m totally puzzeled if you have any clue,

    Please….

    Thankz Eric

  14. Agris
    February 11th, 2009 at 01:20 | #14

    Yes, filenames with non-latin chars cause the problem. Deleting them fixes the problem.

  15. cooper
    March 7th, 2009 at 14:42 | #15

    strace svn status will give you the name of the offending file. unfortunately, svn care about name of files that are in one of its directories, even if it’s not under revision.

  16. ryan
    June 23rd, 2009 at 21:23 | #16

    Evdsande-

    For some reason I never got notified about this comment.

    The issue as reported above applies certainly to Mac OS X, Linux and other Unix systems, but I haven’t used Windows for more than a few minutes in the entire 21st Century, so I’m of limited help here.

    The error does look the same though, so it is most likely one of the files under your “TortoiseRedminePlugin” folder. I would go through the sub-folders one at a time (first to “inc”) and then to the other sub-folders and do the commit one by one.

    One of the files should have the encoding issue as described. Delete that file and the commit should proceed properly.

    Best-

    -R

  17. August 2nd, 2009 at 11:57 | #17

    Great post, I found it via Google. My problem was almost identical. An image got copied into a checked out tree, and svn update began failing due to a wacky character in the image name.

    Thanks for taking the time to post this! It’s always a great feeling to find a solution to strange issues that actually work.

  18. jo
    October 8th, 2009 at 12:43 | #18

    also happens for files that are not part of SVN!

    my program spits out a log file each run — an error in the filename string made a bunch of log files with garbled file names. i got the same error as others (valid UTF-8 followed by invalid UTF-8), even though none of these log files were ever checked into the repo!

  19. October 28th, 2009 at 01:50 | #19

    Just joining in with the thanks!

    Top result on Google for svn “followed by invalid UTF-8 sequence”, you should be proud :)

  20. December 1st, 2009 at 04:20 | #20

    Thanks for this post, this saved me some time. I just want to add my two cents. Sometimes finding the exact file that’s causing the problem is tough. I have a images directory with 2k files and one of these have this problem.

    svn: Error converting entry in directory 'images/thumbnails' to UTF-8
    svn: Valid UTF-8 data
    (hex: 4e 6f 6b 69 61 2d 35 35 33 30 2d)
    followed by invalid UTF-8 sequence
    (hex: 96 2d 58 70)
    

    So this told me the directory was images/thumbnails. To find which file, i did:

    $ printf "\x4e\x6f\x6b\x69\x61\x2d\x35\x35\x33\x30\x2d\n"
    Nokia-5530-
    

    So this told me the filename starts with Nokia-5530- :)

    Hope this helps

    Carlos

  1. February 21st, 2009 at 16:53 | #1