Dino's Sandbox: 取得繁體中文字元筆劃數 (Unicode)

繼上回以 big5 內碼分區查表方式取得中文字元筆劃數後，因無法納入 big5 字集的難字部分無法處理成為先天限制，使用起來頗為不快，不能滿意原解決方案。

於是繼續尋找可行方案，找到了 Unihan 統漢字資料庫，發現其資料十分豐富，倉頡碼、同義字、注音、筆劃數、部首筆劃數、... 等等資訊(沒全參透，就不完整列舉了)，改天會再整理更多應用心得，這邊就先取用我需要的字元筆劃部分。應急 :)

Unihan 資料庫以純文字格式提供為多個檔案，我需要的筆劃資訊都存放在 Unihan 資料庫中的 Unihan_DictionaryLikeData.txt 檔，檔案格式不難解析：

U+3400 kCangjie TM
U+3400 kTotalStrokes 5
U+3401 kCangjie MOW
U+3401 kCihaiT 37.103
U+3401 kTotalStrokes 6
U+3402 kCangjie PPP
U+3402 kTotalStrokes 6

如列表，每行以 tab 分隔，有三個重要資訊：編碼、屬性名稱、屬性資料。

若作為元件而且只是對應筆劃，暫時還不想動用到資料庫，我解析了 Unihan 資料庫中的 Unihan_DictionaryLikeData.txt 檔案，並將筆劃資訊存於 stream 中，並以 Unicode 字碼值，移動指標(查表)來迅速地取得筆劃資訊，主要程式碼如下：

StrokeLookup.cs

using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Reflection;

namespace Unihan
{
    public class StrokeLookup : IDisposable
    {
        private static StrokeLookup _instance;

        public static StrokeLookup Instance 
        { 
            get
            {
                if (_instance == null)
                {
                    _instance = new StrokeLookup();
                }
                return _instance;
            }
        }

        // 利用 stream，存放筆劃資訊，以位移值取得筆劃數
        private Stream _stream;

        private StrokeLookup()
        {
            InitialLookupTable();
        }

        private void InitialLookupTable()
        {
            var binPath = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
            var dataPath = Path.Combine(binPath, "Unihan.Data");
            var filePath = Path.Combine(dataPath, "Unihan_DictionaryLikeData.txt");
            var lookupPath = Path.Combine(dataPath, "Unihan_DictionaryLikeData.strokes");

            // 未曾產生或者 Unihan.Data 目錄中的 Unihan 資料庫有更新，則重新產生查表檔
            // 這裡若改以 hash code 去偵測來源檔案是否有變化會更恰當
            if (!File.Exists(lookupPath) || File.GetLastWriteTime(filePath) > File.GetLastWriteTime(lookupPath))
            {
                using (var stream = new FileStream(lookupPath, FileMode.Create, FileAccess.ReadWrite))
                {
                    GenerateStrokeData(filePath, stream);
                }
            }

            // 若改為以 MemoryStream 載入查表資料，也可以善用記憶體優勢
            _stream = new FileStream(lookupPath, FileMode.Open, FileAccess.Read);
        }

        private void GenerateStrokeData(string filePath, Stream outputStream)
        {
            using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read))
            {
                using (var reader = new StreamReader(stream))
                {
                    var line = string.Empty;

                    while ((line = reader.ReadLine()) != null)
                    {
                        // 非有效行
                        if (string.IsNullOrEmpty(line) || !line.StartsWith("U+"))
                        {
                            continue;
                        }

                        // 每行只切為三分
                        var datas = line.Split(new[] { '\t', ' ' }, 3, StringSplitOptions.RemoveEmptyEntries);

                        // 格式不符或不含有筆劃資訊就忽略
                        if (datas.Length < 3 || datas[1] != "kTotalStrokes")
                        {
                            continue;
                        }

                        // U+3400 轉為 uint
                        var hex = datas[0].Substring(2);
                        var code = uint.Parse(hex, NumberStyles.HexNumber);

                        // 取得筆劃資訊
                        var stroke = byte.Parse(datas[2]);

                        // Padding 補足間隙的不存在字元
                        var gap = code - outputStream.Length;

                        if (gap > 1)
                        {
                            outputStream.Seek(0, SeekOrigin.End);
                            while (gap-- > 1)
                            {
                                outputStream.WriteByte(0);
                            }
                        }

                        outputStream.Seek(code, SeekOrigin.Begin);
                        outputStream.WriteByte(stroke);
                    }
                }
            }
        }

        public IEnumerable<CharStroke> GetStrokes(string source)
        {
            foreach (var chr in source)
            {
                yield return new CharStroke { Character = chr, Stroke = GetStroke(chr) };
            }
        }

        public int GetStroke(char source)
        {
            var code = (uint)source;

            if (code >= 0 && code < _stream.Length)
            {
                _stream.Seek(code, SeekOrigin.Begin);
                return _stream.ReadByte();
            }
            return 0;
        }

        public void Dispose()
        {
            if (_stream != null)
            {
                _stream.Dispose();
            }
        }
    }
}

CharStroke.cs

namespace Unihan
{
    public class CharStroke
    {
        public char Character { get; set; }
        public int Stroke { get; set; }
    }
}

過程中曾經思考直接在類別中以靜態成員 Dictionary<char, byte> 儲存所有字元以及對應筆劃，這樣做在速度上並不慢，程式也好寫，可是頗為消耗記憶體；後來改用了 byte[] 以基底位置加上位移值去計算，想要快速地得到筆劃數，不過耗用記憶體的情形依舊，只是比前者要好一些，後來這兩種方式都放棄了。

範例方案完整原始碼請由此下載，若有任何問題歡迎提供意見，謝謝！ :)

Dino's Sandbox

Pages

2011年12月20日星期二

取得繁體中文字元筆劃數 (Unicode)

參考資料

沒有留言:

Pages

2011年12月20日 星期二

取得繁體中文字元筆劃數 (Unicode)

參考資料

沒有留言:

2011年12月20日星期二