It is currently Fri Oct 20, 2017 3:51 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
 Post subject:
PostPosted: Fri Dec 16, 2005 1:18 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1389
The English language pack has been incorrectly reverted to use the character set ISO-8859-1; as a result, posts made by people using other language packs will appear garbled (where non-English characters are being used).

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 5:58 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
This one's for Quietust:

Good fucking lord MySQL has way too many places to change character set encoding / translation.

I think there's still some leftover crap dealing with UTF-8 migration. The database and tables were converted to UTF-8 long ago like I mentioned, but I believe the character set encoding used between the MySQL client (i.e. the webserver) and the MySQL server (on another box) is still latin1.

According to one of the user comments on this MySQL documentation page, the client/server encoding can screw up UTF-8 as well:

http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html

While using the nesdev_phpbb database, SHOW VARIABLES returns the following (I'm using the standard mysql client on the webserver, which I don't have set to use utf8):

Code:
| character_set_client            | latin1                                                     |
| character_set_connection        | latin1                                                     |
| character_set_database          | utf8                                                       |
| character_set_results           | latin1                                                     |
| character_set_server            | latin1                                                     |
| character_set_system            | utf8                                                       |


I think part-of the solution is to add these queries to the MySQL connection code in phpbb. I think the CHARACTER_SET one is already being used (I think I added this myself), but the NAMES one I don't think I'm using:

Code:
SET NAMES utf8;
SET CHARACTER_SET utf8;


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 6:03 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Strike that.

Looks like db/mysql4.php got overwritten to deal with some changes back in November (Memblers mentioned this). mysql4.php was one of the scripts I had to change to explicitly request a different character set when initiating a mysql connection.

I made a backup of the original, renaming it to mysql4.php__not_utf8_do_not_use:

Code:
-rw-r--r--  1 memblers  users  6066 Nov 16 18:56 mysql.php
-rw-r--r--  1 memblers  users  6486 Nov 16 18:56 mysql4.php
-rw-------  1 memblers  users  6482 Jul 17  2004 mysql4.php__not_utf8_do_not_use


However, a diff between the original non-utf8 and the currently used one (mysql4.php) shows absolutely no character set encoding requests or anything (meaning my changes got wiped out):

Code:
--- mysql4.php  Wed Nov 16 18:56:14 2005
+++ mysql4.php__not_utf8_do_not_use     Sat Jul 17 08:58:20 2004
@@ -6,7 +6,7 @@
  *   copyright            : (C) 2001 The phpBB Group
  *   email                : supportphpbb.com
  *
- *   $Id: mysql4.php,v 1.5.2.1 2005/09/18 16:17:20 acydburn Exp $
+ *   $Id: mysql4.php,v 1.5 2002/04/02 21:13:47 the_systech Exp $
  *
  ***************************************************************************/

@@ -271,7 +271,7 @@
                                {
                                        if( $this->rowset[$query_id] )
                                        {
-                                               $result = $this->rowset[$query_id][0][$field];
+                                               $result = $this->rowset[$query_id][$field];
                                        }
                                        else if( $this->row[$query_id] )
                                        {


So, my guess is that this is what's causing the problem.

I'll edit mysql4.php momentarily to put the UTF-8 stuff back in. Argh. :-)


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 6:19 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Okay, so here's the diff (in the case that we go through this again in the future):

Code:
--- mysql4.php__not_utf8_do_not_use     Wed Nov 16 18:56:13 2005
+++ mysql4.php  Wed Dec 28 17:14:47 2005
@@ -61,6 +61,13 @@
                                }
                        }

+                       /**
+                        * Custom hack for utf8 support; we need to tell the MySQL server, before
+                        * any data is sent/received, to do everything using utf8.
+                        */
+                       @mysql_query("SET NAMES 'utf8'");
+                       @mysql_query("SET CHARACTER_SET utf8");
+
                        return $this->db_connect_id;
                }
                else


This has one major drawback in regards to the board right now though:

Many posts between November and now were done using UTF-8 here on the forums, but were essentially submit into the MySQL database using latin1.

The above change (made today) should fix this and revert things to how they should have been, but definitely breaks non-Latin character posts over the past 2 months.

If I remove the SET NAMES clause, the posts between November and present appear to work correctly. Sounds great, I know, but the problem is that if I remove that clause the client<->server model communicates everything in latin1 (even though the actual encoding type of the content is utf8). I believe there's translation that goes on (i.e. utf8 characters being coallated into latin1 for the MySQL connection then being coallated back into utf8 when the actual data is stored into the database).

SET NAMES 'utf8' basically does this:

Code:
mysql> SET character_set_client = utf8;
mysql> SET character_set_results = utf8;
mysql> SET character_set_connection = utf8;


I'll leave this one up to Quietust to decide.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 6:24 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1389
Unfortunately, this has wiped out nearly every single post in the (now reasonably active) FCDev forum. Do you think you could turn this off temporarily so the existing posts can be saved, and then turn it back on so the posts can be fixed?


Personally, I think it's best to just leave the database with no encoding at all - just let it store the data as if it were binary. Let the software running on the site (phpBB, the wiki, etc.) and the users (namely, their browsers) decide how to interpret it.

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 6:33 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Quietust wrote:
Unfortunately, this has wiped out nearly every single post in the (now reasonably active) FCDev forum. Do you think you could turn this off temporarily so the existing posts can be saved, and then turn it back on so the posts can be fixed?


Sure thing, I'll revert the SET NAMES entry in a moment.

I wish I knew of a way to "convert" the existing data into what would work with SET NAMES, you know what I'm saying? That way I could convert all of the older posts to what presently works, without breaking things. Sadly I don't know of a way to do that. Maybe ALTER can do it, but I'm still not sure how to accomplish that.

Quote:
Personally, I think it's best to just leave the database with no encoding at all - just let it store the data as if it were binary. Let the software running on the site (phpBB, the wiki, etc.) and the users (namely, their browsers) decide how to interpret it.


It doesn't seem to work that way. There's a lot of pieces to the puzzle:

* Database character encoding
* Table character encoding
* Column character encoding (not used in this case though)
* Client-server character encoding
* Coallation for all of the above (i.e. utf8 converted to latin1, etc.)
* Browser character encoding
* The character encoding type specified in the HTTP header

What a mess.

At this point, anything that relies on MySQL requires that you set the character encoding. All of this was introduced in 4.0, and now that 5.0 is the official stable release, I expect to see it even more prominently used.

My personal view is that people (i.e. forum software authors) need to stop mucking around with "support for multiple languages" (by having absurd "packages" for different languages/character sets, etc.). They need to just use utf8 and solve the problem in one swoop.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 6:39 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
I've gone ahead and reverted the SET NAMES addition, but kept SET CHARACTER_SET.

The posts on the FCdev forum look correct to me. I don't have Japanese installed so someone will have to check to be sure

I can see the encoding difference between SET NAMES vs. without SET NAMES, and without SET NAMES the characters look to be the same as they were when we lacked SET CHARACTER_SET.

Might want to make some test posts in the Test forum to be 100% sure.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 6:45 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Ugh, and now I've just read we'll (possibly) be going through even more pain when I get around to upgrading to MySQL 5.0...

http://dev.mysql.com/doc/refman/4.1/en/charset-upgrading.html

But there's hope in some way. In regards to my idea of converting the presently-existing posts from the older format to what presently is correct/works, this might help:

http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 7:01 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19104
Location: NE Indiana, USA (NTSC)
koitsu wrote:
My personal view is that people (i.e. forum software authors) need to stop mucking around with "support for multiple languages" (by having absurd "packages" for different languages/character sets, etc.). They need to just use utf8 and solve the problem in one swoop.

Though there isn't really much of a reason for "packages" for languages in which text is written, there is still a reason for "packages" for languages in which to display the interface (e.g. "Post a reply").


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 7:02 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1389
I've grabbed the text for all of the Japanese posts in the FCdev forum (as well as one in nesemdev), so you can re-add the SET NAMES if you so desire.

Fixing the posts may be a bit troublesome, though, since they'll have to be done manually; fortunately, there are only a dozen topics that need to be fixed. Temporarily commenting out the "$edited_sql = ..." line in functions_post.php (should be line 267) will allow the posts to be edited (by a forum moderator) without inserting/updating the "Last edited [date], [N] edits total" at the bottom of each post.

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 7:09 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Quietust wrote:
I've grabbed the text for all of the Japanese posts in the FCdev forum (as well as one in nesemdev), so you can re-add the SET NAMES if you so desire.


Re-enabled.

Quote:
Fixing the posts may be a bit troublesome, though, since they'll have to be done manually; fortunately, there are only a dozen topics that need to be fixed. Temporarily commenting out the "$edited_sql = ..." line in functions_post.php (should be line 267) will allow the posts to be edited (by a forum moderator) without inserting/updating the "Last edited [date], [N] edits total" at the bottom of each post.


Makes sense. I should add you as a forum moderator and let you take care of it (otherwise I can get to it later tonight); let me know. :-)


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 7:12 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1389
koitsu wrote:
I should add you as a forum moderator and let you take care of it


That'll probably work better - since you said you don't have Japanese fonts installed, you might not be able to tell if they were fixed properly.

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 7:13 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
tepples wrote:
Though there isn't really much of a reason for "packages" for languages in which text is written, there is still a reason for "packages" for languages in which to display the interface (e.g. "Post a reply").


True. I didn't think about that.

Though, admittedly, maintaining a multi-language interface is something everyone's already done (by this I mean the code/framework + all the necessary files are there). The problem is that all of the data is written in a non-Unicode character set, so it wouldn't end up displaying right under utf8 anyways.

Apache has modules for handling stuff like this: mod_negotiation (and the Multiviews directive). Based on browser preferences and some HTTP headers, you can determine what language someone prefers. Compare this to, say, a forum which has a dropdown for what "language" they want the interface in.

Not everyone uses Apache, but this would provide what people want for the most part.

mod_negotiation: http://httpd.apache.org/docs/2.0/mod/mod_negotiation.html


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 7:17 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 10:59 pm
Posts: 1389
The biggest oxy-moron in the phpBB Language Pack system is that while each language specifies a different character set, the forum allows multiple language packs to be installed at once, which results in horrible conflicts when messages get posted.

The obvious solutions would be to either restrict the forum to only allow one language pack at once (which is rather stupid) OR force them all to use the same character set (i.e. Unicode, preferably UTF-8). The phpBB Team is aware of this problem, and it looks like they might fix it in the next major release, but I'm not holding my breath.

_________________
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Dec 28, 2005 7:18 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Quietust wrote:
That'll probably work better - since you said you don't have Japanese fonts installed, you might not be able to tell if they were fixed properly.


Done.

I'll corollate with you on IRC (Freenode) in regards to the post editing stuff and the like.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group