SOLVED - The MeCab/C# problem, more coherently.: snarp

snarp

SOLVED - The MeCab/C# problem, more coherently.

Jun 14, 2011 17:29

Before I wade into stackoverflow with this, any suggestions? I'm trying to use the Japanese morphological analyzer MeCab in a C# program (Visual Studio 2010 Express, Windows 7), and something's going wrong with the encoding.

If my input (pasted into a textbox) is this:

一方、広義の「ネコ」は、ネコ類（ネコ科動物）の一部、あるいはその全ての獣を指す包括的名称を指す。

Then my output (in another textbox) looks like this:

? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
( åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
) åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
????????????????????????? åè©ž,ã‚µå¤‰æŽ¥ç¶š,*,*,*,*,*
EOS

I assume that that's text in some other encoding being mistaken for UTF-8-encoded text. Assuming that it's EUC-JP and using Encoding.Convert to turn it into UTF-8 doesn't change the output; assuming that it's Shift-JIS and doing the same gives different gibberish. Also, while it's definitely processing the text - that's how MeCab output is supposed to be formatted - it doesn't appear to be interpreting the input as UTF-8, either. If it were doing so, there wouldn't be all those identical lines in the output starting with one-character "compounds," which it's clearly unable to identify.

I get yet another different-looking set of gibberish when I run the sentence through MeCab's command line. But, again, it's just a row of single question marks and parentheses going down the left, so it's not just the problem that the Windows command line doesn't support fonts with Japanese characters; again, it's just not reading the input in as UTF-8. (I did install MeCab in UTF-8 mode.)

The relevant parts of the code look like this:

[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static IntPtr mecab_new2(string arg);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
[return: MarshalAs(UnmanagedType.AnsiBStr)]
private extern static string mecab_sparse_tostr(IntPtr m, string str);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static void mecab_destroy(IntPtr m);

private string meCabParse(string jpnText)
{
IntPtr mecab = mecab_new2("");
string parsedText = mecab_sparse_tostr(mecab, jpnText);

mecab_destroy(mecab);
return parsedText;
}

This is how I've been doing the conversion:

// 65001 = UTF-8 codepage, 20932 = EUC-JP codepage
private string convertEncoding(string sourceString, int sourceCodepage, int targetCodepage)
{
Encoding sourceEncoding = Encoding.GetEncoding(sourceCodepage);
Encoding targetEncoding = Encoding.GetEncoding(targetCodepage);

// convert source string into byte array
byte[] sourceBytes = sourceEncoding.GetBytes(sourceString);

// convert those bytes into target encoding
byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);

// byte array to char array
char[] targetChars = new char[targetEncoding.GetCharCount(targetBytes, 0, targetBytes.Length)];

//char array to targt-encoded string
targetEncoding.GetChars(targetBytes, 0, targetBytes.Length, targetChars, 0);
string targetString = new string(targetChars);

return targetString;
}

private string meCabParse(string jpnText)
{
// convert the text from the string from UTF-8 to EUC-JP
jpnText = convertEncoding(jpnText, 65001, 20932);

IntPtr mecab = mecab_new2("");
string parsedText = mecab_sparse_tostr(mecab, jpnText);

// annnd convert back to UTF-8
parsedText = convertEncoding(parsedText, 20932, 65001);

mecab_destroy(mecab);
}

Suggestions/taunts?

-

Solved! Thank you, Cryovat and Tim Gebhardt!

This post was cross-posted from Dreamwidth. Please comment there using OpenID. (Comments:

)

coding, japanese