I would like to describe what we have learned while testing with utf-8 encoded Croatian characters.
Introduction.
One of our previous project issue was java virtual machine (jvm) thread hang. Web service servlet started application server thread by processing incoming xml message. Xml encoding is utf-8.
Then, using spring and hibernate frameworks, jdbc connection towards the Informix database was used in order to store the data into the database. And that thread hang infinitely in some cases in production environment. After we gathered customer bug reports, we confirmed that xml with ‘broken’ encoding caused the hang. By broken encoding we mean when some Croatian character (e.g. ‘š’ with hex code c5 a1) was encoded with some other hex code. We reproduced the case by copy/pasting part of xml message with broken utf-8 Croatian characters. That xml was in mantis bug report. At that moment we did not know how to produce those broken encoding characters. At server side we implemented the code that intercepted those broken encoding and returns appropriate error.
Problem
During the regression test, using the Python as a tool for automation, tester @majapenovic received the broken encoding error. She asked another tester that wrote the testing script about the problem and his solution was to delete Croatian characters used for generating the xml input message. This is VERY BAD TESTERS DECISSION. I told Maja to investigate what was the root of the problem.
utf-8 and Python
We learned about how to use utf-8 in Python from this excellent post.
We are using Jython for writing integration scripts. So from bottom up, you should configure following for proper utf-8 string manipulation:
- jvm that runs jython should have following java option: -Dfile.encoding=UTF-8 (You can find that option in bin/jython file.)
- At the begining of jython script: #coding=utf-8
- your editor encoding must be set to utf-8.
- your keyboard must be set to Croatian (if you work with Croatian utf-8 character set)
croCharsInUnicodeUTF8 = byteStreamReceivedFromHttp.decode(‘utf-8’)
#compare in verification check byte stream with byte stream!
For writing byte stream to file we use following code:
f = codecs.open (file_path, ‘w’, “utf-8”)
f.write(croCharsAsByteStream)
I helped Maja to check is our xml message created with broken encoding. Every xml message is stored in the database. Database encoding was set to utf-8. We unloaded the record with our xml message (informix unload statement), and used vi in hex view mode (:%!xxd) to observe the hex encoding for the character ‘š’ in xml message. At we confirmed that it had wrong hex values.
Tester magic
Maja started investigation. She found bug in our testing script. We called encode() method on byte stream twice in a row in different methods. That caused broken encoding of xml message.
Maja did not followed any best practice, she adopted to context of the problem. She used the existing code functionality. Write to file was called in several places, and she decided to observe the output of that methods. At some point, the output had broken encoding for character ‘š’. After that she easily spotted the line with bug in our script.
What we learned
Testers job is to find proper solutions, not to do ‘dirty workarounds’. As a result of problem investigation, we now know how to reproduce the broken encoding error.
Question
Are you able to see in browser Croatian characters from this blog post?
Update regarding replace string method
If you need to do something like this:
request = request.replace(“placeholder”, unicode( ‘Š’, ‘utf-8’ ) )
you will get following exception:
UnicodeDecodeError: ‘ascii’ codec can’t decode byte
only if string in which you are doing the replacement (in this example request string) is not in utf-8 format. You usually put some string in utf-8 either by using decode(‘utf-8’) method, or reading a file using following code snippet:
f = codecs.open (‘file_path’, ‘r’, “utf-8”)
request= f.read()
Update2 regarding UnicodeDecodeError: ‘ascii’ codec can’t decode byte
Today I found an excellent post about famous exception
UnicodeDecodeError: ‘ascii’ codec can’t decode byte
From now I finally understand that rather strange exception. In Python 2.x encode and decode methods work on unicode strings. That means if you try to call them on byte object, Python will implicitly try to decode byte to unicode object using default ‘ascii’ encoding. If your byte object contains bytes out of ascii range, you will trigger that exception.
The problem is that Python2.x is using ascii as default encoding because of the “historical reasons”.
I can see Croatian characters (ŠĐČĆŽšđčćž), but that should not be a surprise. 🙂