NuSphere Forums Forum Index
NuSphere Forums
Reply to topic
Detecting UTF-8 correctly *solved*


Joined: 08 Mar 2006
Posts: 63
Reply with quote
I just started to look at PHPEd 4.5, and I'm very impressed. Just one gripe - and it may well be something I missed.

I work with a mix of UTF-8 and ISO-8859-1 encoded files. That means I should keep the Default file encoding on System default encoding, right? Well, with this setup my UTF-8 encoded files are not detected as such when I first open them. However, if I do a Save as with UTF-8 encoding, they are correcly detected when I reopen them later. This seems pretty strange to me, since the file content has not changed - PHPEd does not add a BOM as far as I can see. So what's going on here? It is a bit of a pain in the a** to resave all the files that should be in UTF-8 encoding.

Can I get PHPEd to detect them correctly somehow?
View user's profileFind all posts by svenaxSend private message
Site Admin

Joined: 13 Jul 2003
Posts: 8334
Reply with quote
I would recommend you to split mixed file(s) into different files. Each with its own encoding.
Quote:
UTF-8 encoded files are not detected as such

PhpED does not attept to "detect" anything. It just follow your instructions. If you set Default system encoding, it will be used.
Quote:
PHPEd does not add a BOM

BOM stands for Byte Order Mark and it has no relations to single-byte encodings like ISO-8859-1 or UTF-8.
If you need BOM'ed encodings, use UTF-16 family which has LE (little endian) and BE (big endian).
View user's profileFind all posts by dmitriSend private messageVisit poster's website


Joined: 08 Mar 2006
Posts: 63
Reply with quote
ddmitrie wrote:
I would recommend you to split mixed file(s) into different files. Each with its own encoding.

It seems we have a serious misunderstanding here. Of course I am not trying to use different encodings in the same file. That is obviously impossible.

ddmitrie wrote:
Quote:
UTF-8 encoded files are not detected as such

PhpED does not attept to "detect" anything. It just follow your instructions. If you set Default system encoding, it will be used.

Well, if that is the case, how can PHPEd know that a file is UTF-8 after it has been saved as such (from within PHPEd)? It is still just a bunch of bytes.

ddmitrie wrote:
Quote:
PHPEd does not add a BOM

BOM stands for Byte Order Mark and it has no relations to single-byte encodings like ISO-8859-1 or UTF-8.
If you need BOM'ed encodings, use UTF-16 family which has LE (little endian) and BE (big endian).

No, I know that. I just mentioned that PHPEd doesn't add a BOM (which certainly can be used for UTF-8 files) to show that the file content hasn't changed at all, but still my UTF-8 files were not detected as such until after I saved them from within PHPEd.
View user's profileFind all posts by svenaxSend private message
Site Admin

Joined: 13 Jul 2003
Posts: 8334
Reply with quote
Quote:
Well, if that is the case, how can PHPEd know that a file is UTF-8 after it has been saved as such (from within PHPEd)? It is still just a bunch of bytes

PhpED remembered that a different encoding (utf-8 ) was selected when you saved the file.
Next time when you open the file, it applies this encondig instead of the "system default".

Quote:
No, I know that. I just mentioned that PHPEd doesn't add a BOM (which certainly can be used for UTF-8 files)

Mostly BOM make sense for encodings that use 2 or 4 bytes per symbol and while UTF-8 is single byte encoding, BOM's usage in unknown to me.
For example, many XML files are utf-8 encoded. Have you ever seen any BOMs in them?
View user's profileFind all posts by dmitriSend private messageVisit poster's website


Joined: 08 Mar 2006
Posts: 63
Reply with quote
ddmitrie wrote:
Quote:
Well, if that is the case, how can PHPEd know that a file is UTF-8 after it has been saved as such (from within PHPEd)? It is still just a bunch of bytes

PhpED remembered that a different encoding (utf-8 ) was selected when you saved the file.
Next time when you open the file, it applies this encondig instead of the "system default".

OK, so the answer to my question is, no, PHPEdit doesn't know which files are UTF-8 encoded until they have been saved as such by the program itself. That is pretty inconvenient. Other editors can usually detect the encoding using some suitable heuristics (such as detecting a BOM mark).

ddmitrie wrote:
Quote:
No, I know that. I just mentioned that PHPEd doesn't add a BOM (which certainly can be used for UTF-8 files)

Mostly BOM make sense for encodings that use 2 or 4 bytes per symbol and while UTF-8 is single byte encoding, BOM's usage in unknown to me.
For example, many XML files are utf-8 encoded. Have you ever seen any BOMs in them?

Information about Byte Order Marks: http://www.unicode.org/faq/utf_bom.html#25
View user's profileFind all posts by svenaxSend private message
Site Admin

Joined: 13 Jul 2003
Posts: 8334
Reply with quote
First, phped. It's phped, not phpedit Smile
And it does not use BOM, truth.
BOM is really rarely used so I persoanally do not think it's a big deal at all.
If you have an UTF8 file, just open it as UTF8 (select Utf8 in appropriate combo in File Open dialog) and it will work fine.
Regarding "suitable heuristics", do they work stable and return correct results in all cases? Smile
View user's profileFind all posts by dmitriSend private messageVisit poster's website


Joined: 08 Mar 2006
Posts: 63
Reply with quote
Yes, yes, let's forget about BOM:s for the moment. The problem is, as you could well understand, that I have a lot of files that must have their encoding set. I'd rather not open every single one of them from the file menu. Where is this information kept? Perhaps the encoding information that PHPEd uses can be updated directly.

And as to heuristics; I would say this is a pretty good indicator, don't you agree?

<?xml version="1.0" encoding="utf-8"?>
View user's profileFind all posts by svenaxSend private message
Site Admin

Joined: 13 Jul 2003
Posts: 8334
Reply with quote
fileenc.cfg Smile contains all the encodings for files you opened. This file is an XML and you may change it directly.

Quote:
And as to heuristics; I would say this is a pretty good indicator, don't you agree?
<?xml version="1.0" encoding="utf-8"?>

No doubts Smile but in case of php it would never work Smile
View user's profileFind all posts by dmitriSend private messageVisit poster's website


Joined: 08 Mar 2006
Posts: 63
Reply with quote
ddmitrie wrote:
fileenc.cfg Smile contains all the encodings for files you opened. This file is an XML and you may change it directly.

Good. I'll have a look.

ddmitrie wrote:
Quote:
And as to heuristics; I would say this is a pretty good indicator, don't you agree?
<?xml version="1.0" encoding="utf-8"?>

No doubts Smile but in case of php it would never work Smile

No, but PhpED can edit other file types too, can't it?
View user's profileFind all posts by svenaxSend private message
Site Admin

Joined: 13 Jul 2003
Posts: 8334
Reply with quote
we released build 4510.
now it recognizes BOM, encoding for xml and html Smile
thanks for pointing out to the problem.
View user's profileFind all posts by dmitriSend private messageVisit poster's website
Detecting UTF-8 correctly *solved*
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
All times are GMT - 5 Hours  
Page 1 of 1  

  
  
 Reply to topic