我想拆分一个字符串,如:
输入:印度铁路班加罗尔铁路线。 它隶属于中央铁路那格浦尔分部。
输出:
Bangalore
railway
line
Indian Railway
comes
under
Nagpur
division
Central Railway
请注意,复合名词将保持在一起,因为它们是标题大小写。
我特别在regex部分遇到了麻烦:split(/(?=\s[a-z][a-z]\s\.)/)
我如何让它分裂在‘水托博物馆’的场景?
export function splitByPhrase(text: string) {
const outputFreq = text
.split(/(?=\s[a-z]|[A-Z]\s|\.)/)
.filter(Boolean)
.map((x) => x.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, "").trim())
.filter((x) => !stopWords.includes(x));
return outputFreq;
}
describe("phrases", () => {
it("no punctuation", () => {
expect(splitByPhrase("test. Toronto")).toEqual(["test", "Toronto"]);
});
it("no spaces", () => {
expect(splitByPhrase(" test Toronto ")).toEqual(["test", "Toronto"]);
});
it("simple phrase detection", () => {
expect(splitByPhrase(" water Tor Museum wants")).toEqual(["water", "Tor Museum", "wants"]);
});
it("remove stop words", () => {
expect(splitByPhrase("Toronto a Museum with")).toEqual(["Toronto", "Museum"]);
});
});
只有当断言左侧不是大写字符,后面跟着小写字符,并且右侧没有大写字符时,您才可以添加另一种方法来拆分。
(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))
正则表达式演示
null
const stopWords = [
"of", "The", "It", "the", "a", "with"
];
function splitByPhrase(text) {
return text
.split(/(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))/)
.map((x) => x.replace(/[.,\/#!$%^&*;:{}=_`~()-]/g, "").trim())
.filter((x) => !stopWords.includes(x)).filter(Boolean);
}
[
"Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.",
"test. Toronto",
" test Toronto ",
" water Tor Museum wants",
"Toronto a Museum with"
].forEach(i => console.log(splitByPhrase(i)));
对于在标题大小写单词之前切分小写单词的情况,我认为split(\s(?=[a-z][a-z]\w+\.))
可以满足您的需要。
https://regexr.com/59jfo
输入:印度铁路班加罗尔铁路线。 它隶属于中央铁路那格浦尔分部。
输出:
Bangalore
railway
line
of
the
Indian Railway.
It
comes
under
Nagpur
division
of
the
Central Railway.