Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSE.CutAll not work well for some Chinese text #6115

Open
1 task done
smoothdvd opened this issue Oct 28, 2024 · 0 comments
Open
1 task done

GSE.CutAll not work well for some Chinese text #6115

smoothdvd opened this issue Oct 28, 2024 · 0 comments
Labels

Comments

@smoothdvd
Copy link

smoothdvd commented Oct 28, 2024

How to reproduce this bug?

query Get {
    Get {
        NewspaperArticle_V2(
            limit: 10000
            nearVector: {
                vector: [... ]
            }
            where: {
                operator: Or
                operands: [{ path: ["content"], operator: ContainsAll, valueText: ["黄海峰","刘捷"] }]
            }
        ) {
            title
            articleId
            content
        }
    }
}

In the some case, when the GSE.CutAll will not generate correctly tokens from input the Chinese text.
For example:
source: "本报讯(首席记者 赵芳洲)平安杭州建设20周年大会昨日下午召开。省委副书记、市委书记刘捷"

result in
GSE.CutAll: [本报 本报讯 ( 首席 首席记者 记者 赵 芳 洲 ) 平安 杭州 建设 2 0 周年 大会 昨日 下午 召开 。 省委 副 书记 、 市委 市委书记 书记 刘 捷]
// use DAG and HMM GSE.Cut(text, true): [本报讯 ( 首席记者 赵芳洲 ) 平安 杭州 建设 20 周年 大会 昨日 下午 召开 。 省委 副 书记 、 市委书记 刘捷]
//cut search use hmm: GSE.CutSearch(text, true): [本报 本报讯 ( 首席 记者 首席记者 赵芳洲 ) 平安 杭州 建设 20 周年 大会 昨日 下午 召开 。 省委 副 书记 、 市委 书记 市委书记 刘捷]

Some Chinese person name and others has wrong tokenized: '刘 捷' should be '刘捷', '赵 芳 洲' should be '赵芳洲', '2 0' should be '20'

Even in go-ego/gse 's example can see the difference,
https://github.com/go-ego/gse/blob/627fa87efa481d4f734d6e06798363a4e1dde1d8/examples/main.go#L99C3-L99C4
'imax' is tokenized 'i m a x' in CutAll method, that's not correct.

What is the expected behavior?

use DAG and HMM GSE.Cut(text, true) or cut search use hmm: GSE.CutSearch(text, true) to generate tokens

What is the actual behavior?

wrong tokens generate by GSE.CutAll method

Supporting information

No response

Server Version

1.27.0

Weaviate Setup

Single Node

Nodes count

No response

Code of Conduct

@smoothdvd smoothdvd added the bug label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant