HanLP: Han Language Processing

研究者や企業向けの多言語NLPライブラリで、PyTorchとTensorFlow 2.xをベースに構築されており、学術界と産業界の両方で最先端の深層学習技術を発展させるためのものです。HanLPは初日から、効率的で使いやすく、拡張性があるように設計されています。

Universal DependenciesやOntoNotesのようなオープンアクセスのコーパスのおかげで、HanLP 2.1は104言語の共同タスクを提供しています：形態素解析、係り受け解析、句構造解析、述語項構造、意味的依存性解析、抽象的意味表現（AMR）解析。

エンドユーザに対しては、HanLPは軽量なRESTful APIとネイティブなPython APIを提供します。

RESTful APIs

アジャイル開発やモバイルアプリケーションのための、数KBの小さなパッケージです。匿名での利用も可能ですが、認証キーの使用が推奨されており、CC BY-NC-SA 4.0ライセンスのもと、フリーで使用できます。

Python

pip install hanlp_restful

まずはAPI URLとあなたの認証キーでクライアントを作成します。

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul')

Java

以下の依存関係をpom.xmlに挿入します。

<dependency>
  <groupId>com.hankcs.hanlp.restful</groupId>
  <artifactId>hanlp-restful</artifactId>
  <version>0.0.7</version>
</dependency>

まずはAPI URLとあなたの認証キーでクライアントを作成します。

HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul");

Quick Start

どの言語を使っていても、同じインターフェースで言語を解析することができます。

HanLP.parse("In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。")

視覚化、アノテーションのガイドライン、その他の詳細については、この説明を参照してください。

Native APIs

pip install hanlp

HanLPにはPython 3.6以降が必要です。GPU/TPUが推奨されていますが、必須ではありません

Quick Start

import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.NPCMJ_UD_KYOTO_TOK_POS_CON_BERT_BASE_CHAR_JA)
print(HanLP(['2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。',
             '奈須きのこは1973年11月28日に千葉県円空山で生まれ、ゲーム制作会社「ノーツ」の設立者だ。',]))

特に、PythonのHanLPClientは、同じセマンティクスに従って呼び出し可能な関数としても使用できます。視覚化、アノテーションのガイドライン、および詳細については、この説明を参照してください。

自分のモデル

DLモデルを書くことは難しくありませんが、本当に難しいのは、論文のスコアを再現できるモデルを書くことです。下記のスニペットは、6分で最先端のトークナイザーを超える方法を示しています。

tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.70'
tokenizer.fit(
    SIGHAN2005_PKU_TRAIN_ALL,
    SIGHAN2005_PKU_TEST,  # Conventionally, no devset is used. See Tian et al. (2020).
    save_dir,
    'bert-base-chinese',
    max_seq_len=300,
    char_level=True,
    hard_constraint=True,
    sampler_builder=SortingSamplerBuilder(batch_size=32),
    epochs=3,
    adam_epsilon=1e-6,
    warmup_steps=0.1,
    weight_decay=0.01,
    word_dropout=0.1,
    seed=1609836303,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)

ランダムフィードが固定されているため、結果は96.70であることが保証されています。いくつかの過大評価されている論文やプロジェクトとは異なり、HanLPはスコアの一桁ごとに再現性があることを約束します。再現性に問題がある場合は、最優先で致命的なバグとして扱われ、解決されます。

パフォーマンス

lang	corpora	model	tok		pos				ner			dep	con	srl	sdp				lem	fea	amr
lang	corpora	model	fine	coarse	ctb	pku	863	ud	pku	msra	ontonotes	dep	con	srl	SemEval16	DM	PAS	PSD	lem	fea	amr
mul	UD2.7 OntoNotes5	small	98.62	-	-	-	-	93.23	-	-	74.42	79.10	76.85	70.63	-	91.19	93.67	85.34	87.71	84.51	-
mul	UD2.7 OntoNotes5	base	99.67	-	-	-	-	96.51	-	-	80.76	87.64	80.58	77.22	-	94.38	96.10	86.64	94.37	91.60	-
zh	open	small	97.25	-	96.66	-	-	-	-	-	95.00	84.57	87.62	73.40	84.57	-	-	-	-	-	-
	open	base	97.50	-	97.07	-	-	-	-	-	96.04	87.11	89.84	77.78	87.11	-	-	-	-	-	-
	close	small	96.70	95.93	96.87	97.56	95.05	-	96.22	95.74	76.79	84.44	88.13	75.81	74.28	-	-	-	-	-	-
	close	base	97.52	96.44	96.99	97.59	95.29	-	96.48	95.72	77.77	85.29	88.57	76.52	73.76	-	-	-	-	-	-

AMRモデルは、論文が採択された時点で公開されます。

Citing

あなたの研究でHanLPを使用する場合は、このリポジトリを引用してください。

@inproceedings{he-choi-2021-stem,
    title = "The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders",
    author = "He, Han and Choi, Jinho D.",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.451",
    pages = "5555--5577",
    abstract = "Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks for both accuracy and efficiency while a question still remains whether or not it would perform as well on tasks that are distinct in nature. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL, and depict its deficiency over single-task learning. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL, who interfere with one another to fine-tune those heads for their own objectives. Based on this finding, we propose the Stem Cell Hypothesis to reveal the existence of attention heads naturally talented for many tasks that cannot be jointly trained to create adequate embeddings for all of those tasks. Finally, we design novel parameter-free probes to justify our hypothesis and demonstrate how attention heads are transformed across the five tasks during MTL through label analysis.",
}

License

Codes

HanLPは、Apache License 2.0でライセンスされています。HanLPは、お客様の商用製品に無料でお使いいただけます。あなたのウェブサイトにHanLPへのリンクを追加していただければ幸いです。

Models

特に断りのない限り、HanLPのすべてのモデルは、CC BY-NC-SA 4.0でライセンスされています。

References

https://hanlp.hankcs.com/docs/references.html

Name		Name	Last commit message	Last commit date
Latest commit History 1,556 Commits
.github		.github
docs		docs
hanlp		hanlp
plugins		plugins
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HanLP: Han Language Processing

RESTful APIs

Python

Java

Quick Start

Native APIs

Quick Start

自分のモデル

パフォーマンス

Citing

License

Codes

Models

References

About

Uh oh!

Releases

Packages

Languages

License

Fakerycoder/HanLP

Folders and files

Latest commit

History

Repository files navigation

HanLP: Han Language Processing

RESTful APIs

Python

Java

Quick Start

Native APIs

Quick Start

自分のモデル

パフォーマンス

Citing

License

Codes

Models

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages