[personal profile] snarp
Before I wade into stackoverflow with this, any suggestions? I'm trying to use the Japanese morphological analyzer MeCab in a C# program (Visual Studio 2010 Express, Windows 7), and something's going wrong with the encoding.

If my input (pasted into a textbox) is this:

一方、広義の「ネコ」は、ネコ類(ネコ科動物)の一部、あるいはその全ての獣を指す包括的名称を指す。


Then my output (in another textbox) looks like this:

?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
(	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
)	名詞,サ変接続,*,*,*,*,*
?	名詞,サ変接続,*,*,*,*,*
?????????????????????????	名詞,サ変接続,*,*,*,*,*
EOS


I assume that that's text in some other encoding being mistaken for UTF-8-encoded text. Assuming that it's EUC-JP and using Encoding.Convert to turn it into UTF-8 doesn't change the output; assuming that it's Shift-JIS and doing the same gives different gibberish. Also, while it's definitely processing the text - that's how MeCab output is supposed to be formatted - it doesn't appear to be interpreting the input as UTF-8, either. If it were doing so, there wouldn't be all those identical lines in the output starting with one-character "compounds," which it's clearly unable to identify.

I get yet another different-looking set of gibberish when I run the sentence through MeCab's command line. But, again, it's just a row of single question marks and parentheses going down the left, so it's not just the problem that the Windows command line doesn't support fonts with Japanese characters; again, it's just not reading the input in as UTF-8. (I did install MeCab in UTF-8 mode.)

The relevant parts of the code look like this:

[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static IntPtr mecab_new2(string arg);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
[return: MarshalAs(UnmanagedType.AnsiBStr)]
private extern static string mecab_sparse_tostr(IntPtr m, string str);
[DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl)]
private extern static void mecab_destroy(IntPtr m);

private string meCabParse(string jpnText)
{
	IntPtr mecab = mecab_new2("");
	string parsedText = mecab_sparse_tostr(mecab, jpnText);
	
	mecab_destroy(mecab);
	return parsedText;
}


This is how I've been doing the conversion:

// 65001 = UTF-8 codepage, 20932 = EUC-JP codepage
private string convertEncoding(string sourceString, int sourceCodepage, int targetCodepage)
{
	Encoding sourceEncoding = Encoding.GetEncoding(sourceCodepage); 
	Encoding targetEncoding = Encoding.GetEncoding(targetCodepage);

	// convert source string into byte array
	byte[] sourceBytes = sourceEncoding.GetBytes(sourceString);

	// convert those bytes into target encoding
	byte[] targetBytes = Encoding.Convert(sourceEncoding, targetEncoding, sourceBytes);

	// byte array to char array
	char[] targetChars = new char[targetEncoding.GetCharCount(targetBytes, 0, targetBytes.Length)];

	//char array to targt-encoded string
	targetEncoding.GetChars(targetBytes, 0, targetBytes.Length, targetChars, 0);
	string targetString = new string(targetChars);

	return targetString;
}

private string meCabParse(string jpnText)
{
	// convert the text from the string from UTF-8 to EUC-JP
	jpnText = convertEncoding(jpnText, 65001, 20932);

	IntPtr mecab = mecab_new2("");
	string parsedText = mecab_sparse_tostr(mecab, jpnText);

	// annnd convert back to UTF-8
	parsedText = convertEncoding(parsedText, 20932, 65001);

	mecab_destroy(mecab);
}


Suggestions/taunts?

-

Solved! Thank you, Cryovat and Tim Gebhardt!
This account has disabled anonymous posting.
If you don't have an account you can create one now.
No Subject Icon Selected
More info about formatting

Loading anti-spam test...

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org

December 2018

S M T W T F S
      1
2345 678
9101112131415
16171819202122
23242526272829
3031     

Style Credit

Page generated Jun. 26th, 2025 07:11 am
Powered by Dreamwidth Studios

Expand Cut Tags

No cut tags

Most Popular Tags

Creative Commons



The contents of this blog and all comments I make are licensed under a Creative Commons Attribution-Noncommercial-Share Alike License. I hope that name is long enough. I could add some stuff. It could also be a Bring Me A Sandwich License.

If you desire to thank me for the pretend internet magnanimity I show by sharing my important and serious thoughts with you, I accept pretend internet dollars (Bitcoins): 19BqFnAHNpSq8N2A1pafEGSqLv4B6ScstB