提问者:小点点

简单的短语检测,按短语正则表达式拆分


我想拆分一个字符串,如:

输入:印度铁路班加罗尔铁路线。 它隶属于中央铁路那格浦尔分部。

输出:

Bangalore 
railway 
line
Indian Railway 
comes
under 
Nagpur 
division
Central Railway

请注意,复合名词将保持在一起,因为它们是标题大小写。

我特别在regex部分遇到了麻烦:split(/(?=\s[a-z][a-z]\s\.)/)

我如何让它分裂在‘水托博物馆’的场景?

export function splitByPhrase(text: string) {
  const outputFreq = text
    .split(/(?=\s[a-z]|[A-Z]\s|\.)/)
    .filter(Boolean)
    .map((x) => x.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, "").trim())
    .filter((x) => !stopWords.includes(x));

  return outputFreq;
}

describe("phrases", () => {
  it("no punctuation", () => {
    expect(splitByPhrase("test. Toronto")).toEqual(["test", "Toronto"]);
  });
  it("no spaces", () => {
    expect(splitByPhrase(" test Toronto ")).toEqual(["test", "Toronto"]);
  });
  it("simple phrase detection", () => {
    expect(splitByPhrase(" water Tor Museum wants")).toEqual(["water", "Tor Museum", "wants"]);
  });
  it("remove stop words", () => {
    expect(splitByPhrase("Toronto a Museum with")).toEqual(["Toronto", "Museum"]);
  });
});

共2个答案

匿名用户

只有当断言左侧不是大写字符,后面跟着小写字符,并且右侧没有大写字符时,您才可以添加另一种方法来拆分。

(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))

正则表达式演示

null

const stopWords = [
  "of", "The", "It", "the", "a", "with"
];

function splitByPhrase(text) {
  return text
    .split(/(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))/)
    .map((x) => x.replace(/[.,\/#!$%^&*;:{}=_`~()-]/g, "").trim())
    .filter((x) => !stopWords.includes(x)).filter(Boolean);
}

[
  "Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.",
  "test. Toronto",
  " test Toronto ",
  " water Tor Museum wants",
  "Toronto a Museum with"
].forEach(i => console.log(splitByPhrase(i)));

匿名用户

对于在标题大小写单词之前切分小写单词的情况,我认为split(\s(?=[a-z][a-z]\w+\.))可以满足您的需要。

https://regexr.com/59jfo

输入:印度铁路班加罗尔铁路线。 它隶属于中央铁路那格浦尔分部。

输出:

Bangalore
railway
line
of
the
Indian Railway.
It
comes
under
Nagpur
division
of
the
Central Railway.