This commit is contained in:
Iliyan Angelov
2025-12-01 06:50:10 +02:00
parent 91f51bc6fe
commit 62c1fe5951
4682 changed files with 544807 additions and 31208 deletions

View File

@@ -0,0 +1,205 @@
# Natural Language Toolkit: Chunkers
#
# Copyright (C) 2001-2025 NLTK Project
# Author: Steven Bird <stevenbird1@gmail.com>
# Edward Loper <edloper@gmail.com>
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT
#
"""
Classes and interfaces for identifying non-overlapping linguistic
groups (such as base noun phrases) in unrestricted text. This task is
called "chunk parsing" or "chunking", and the identified groups are
called "chunks". The chunked text is represented using a shallow
tree called a "chunk structure." A chunk structure is a tree
containing tokens and chunks, where each chunk is a subtree containing
only tokens. For example, the chunk structure for base noun phrase
chunks in the sentence "I saw the big dog on the hill" is::
(SENTENCE:
(NP: <I>)
<saw>
(NP: <the> <big> <dog>)
<on>
(NP: <the> <hill>))
To convert a chunk structure back to a list of tokens, simply use the
chunk structure's ``leaves()`` method.
This module defines ``ChunkParserI``, a standard interface for
chunking texts; and ``RegexpChunkParser``, a regular-expression based
implementation of that interface. It also defines ``ChunkScore``, a
utility class for scoring chunk parsers.
RegexpChunkParser
=================
``RegexpChunkParser`` is an implementation of the chunk parser interface
that uses regular-expressions over tags to chunk a text. Its
``parse()`` method first constructs a ``ChunkString``, which encodes a
particular chunking of the input text. Initially, nothing is
chunked. ``parse.RegexpChunkParser`` then applies a sequence of
``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies
the chunking that it encodes. Finally, the ``ChunkString`` is
transformed back into a chunk structure, which is returned.
``RegexpChunkParser`` can only be used to chunk a single kind of phrase.
For example, you can use an ``RegexpChunkParser`` to chunk the noun
phrases in a text, or the verb phrases in a text; but you can not
use it to simultaneously chunk both noun phrases and verb phrases in
the same text. (This is a limitation of ``RegexpChunkParser``, not of
chunk parsers in general.)
RegexpChunkRules
----------------
A ``RegexpChunkRule`` is a transformational rule that updates the
chunking of a text by modifying its ``ChunkString``. Each
``RegexpChunkRule`` defines the ``apply()`` method, which modifies
the chunking encoded by a ``ChunkString``. The
``RegexpChunkRule`` class itself can be used to implement any
transformational rule based on regular expressions. There are
also a number of subclasses, which can be used to implement
simpler types of rules:
- ``ChunkRule`` chunks anything that matches a given regular
expression.
- ``StripRule`` strips anything that matches a given regular
expression.
- ``UnChunkRule`` will un-chunk any chunk that matches a given
regular expression.
- ``MergeRule`` can be used to merge two contiguous chunks.
- ``SplitRule`` can be used to split a single chunk into two
smaller chunks.
- ``ExpandLeftRule`` will expand a chunk to incorporate new
unchunked material on the left.
- ``ExpandRightRule`` will expand a chunk to incorporate new
unchunked material on the right.
Tag Patterns
~~~~~~~~~~~~
A ``RegexpChunkRule`` uses a modified version of regular
expression patterns, called "tag patterns". Tag patterns are
used to match sequences of tags. Examples of tag patterns are::
r'(<DT>|<JJ>|<NN>)+'
r'<NN>+'
r'<NN.*>'
The differences between regular expression patterns and tag
patterns are:
- In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so
``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not
``'<NN'`` followed by one or more repetitions of ``'>'``.
- Whitespace in tag patterns is ignored. So
``'<DT> | <NN>'`` is equivalent to ``'<DT>|<NN>'``
- In tag patterns, ``'.'`` is equivalent to ``'[^{}<>]'``; so
``'<NN.*>'`` matches any single tag starting with ``'NN'``.
The function ``tag_pattern2re_pattern`` can be used to transform
a tag pattern to an equivalent regular expression pattern.
Efficiency
----------
Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a
rate of about 300 tokens/second, with a moderately complex rule set.
There may be problems if ``RegexpChunkParser`` is used with more than
5,000 tokens at a time. In particular, evaluation of some regular
expressions may cause the Python regular expression engine to
exceed its maximum recursion depth. We have attempted to minimize
these problems, but it is impossible to avoid them completely. We
therefore recommend that you apply the chunk parser to a single
sentence at a time.
Emacs Tip
---------
If you evaluate the following elisp expression in emacs, it will
colorize a ``ChunkString`` when you use an interactive python shell
with emacs or xemacs ("C-c !")::
(let ()
(defconst comint-mode-font-lock-keywords
'(("<[^>]+>" 0 'font-lock-reference-face)
("[{}]" 0 'font-lock-function-name-face)))
(add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
You can evaluate this code by copying it to a temporary buffer,
placing the cursor after the last close parenthesis, and typing
"``C-x C-e``". You should evaluate it before running the interactive
session. The change will last until you close emacs.
Unresolved Issues
-----------------
If we use the ``re`` module for regular expressions, Python's
regular expression engine generates "maximum recursion depth
exceeded" errors when processing very large texts, even for
regular expressions that should not require any recursion. We
therefore use the ``pre`` module instead. But note that ``pre``
does not include Unicode support, so this module will not work
with unicode strings. Note also that ``pre`` regular expressions
are not quite as advanced as ``re`` ones (e.g., no leftward
zero-length assertions).
:type CHUNK_TAG_PATTERN: regexp
:var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
pattern is valid.
"""
from nltk.chunk.api import ChunkParserI
from nltk.chunk.named_entity import Maxent_NE_Chunker
from nltk.chunk.regexp import RegexpChunkParser, RegexpParser
from nltk.chunk.util import (
ChunkScore,
accuracy,
conllstr2tree,
conlltags2tree,
ieerstr2tree,
tagstr2tree,
tree2conllstr,
tree2conlltags,
)
def ne_chunker(fmt="multiclass"):
"""
Load NLTK's currently recommended named entity chunker.
"""
return Maxent_NE_Chunker(fmt)
def ne_chunk(tagged_tokens, binary=False):
"""
Use NLTK's currently recommended named entity chunker to
chunk the given list of tagged tokens.
>>> from nltk.chunk import ne_chunk
>>> from nltk.corpus import treebank
>>> from pprint import pprint
>>> pprint(ne_chunk(treebank.tagged_sents()[2][8:14])) # doctest: +NORMALIZE_WHITESPACE
Tree('S', [('chairman', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP')]), ('PLC', 'NNP')])
"""
if binary:
chunker = ne_chunker(fmt="binary")
else:
chunker = ne_chunker()
return chunker.parse(tagged_tokens)
def ne_chunk_sents(tagged_sentences, binary=False):
"""
Use NLTK's currently recommended named entity chunker to chunk the
given list of tagged sentences, each consisting of a list of tagged tokens.
"""
if binary:
chunker = ne_chunker(fmt="binary")
else:
chunker = ne_chunker()
return chunker.parse_sents(tagged_sentences)

View File

@@ -0,0 +1,56 @@
# Natural Language Toolkit: Chunk parsing API
#
# Copyright (C) 2001-2025 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
# Steven Bird <stevenbird1@gmail.com> (minor additions)
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT
##//////////////////////////////////////////////////////
## Chunk Parser Interface
##//////////////////////////////////////////////////////
from nltk.chunk.util import ChunkScore
from nltk.internals import deprecated
from nltk.parse import ParserI
class ChunkParserI(ParserI):
"""
A processing interface for identifying non-overlapping groups in
unrestricted text. Typically, chunk parsers are used to find base
syntactic constituents, such as base noun phrases. Unlike
``ParserI``, ``ChunkParserI`` guarantees that the ``parse()`` method
will always generate a parse.
"""
def parse(self, tokens):
"""
Return the best chunk structure for the given tokens
and return a tree.
:param tokens: The list of (word, tag) tokens to be chunked.
:type tokens: list(tuple)
:rtype: Tree
"""
raise NotImplementedError()
@deprecated("Use accuracy(gold) instead.")
def evaluate(self, gold):
return self.accuracy(gold)
def accuracy(self, gold):
"""
Score the accuracy of the chunker against the gold standard.
Remove the chunking the gold standard text, rechunk it using
the chunker, and return a ``ChunkScore`` object
reflecting the performance of this chunk parser.
:type gold: list(Tree)
:param gold: The list of chunked sentences to score the chunker on.
:rtype: ChunkScore
"""
chunkscore = ChunkScore()
for correct in gold:
chunkscore.score(correct, self.parse(correct.leaves()))
return chunkscore

View File

@@ -0,0 +1,407 @@
# Natural Language Toolkit: Chunk parsing API
#
# Copyright (C) 2001-2025 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
# Eric Kafe <kafe.eric@gmail.com> (tab-format models)
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT
"""
Named entity chunker
"""
import os
import re
from xml.etree import ElementTree as ET
from nltk.tag import ClassifierBasedTagger, pos_tag
try:
from nltk.classify import MaxentClassifier
except ImportError:
pass
from nltk.chunk.api import ChunkParserI
from nltk.chunk.util import ChunkScore
from nltk.data import find
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
class NEChunkParserTagger(ClassifierBasedTagger):
"""
The IOB tagger used by the chunk parser.
"""
def __init__(self, train=None, classifier=None):
ClassifierBasedTagger.__init__(
self,
train=train,
classifier_builder=self._classifier_builder,
classifier=classifier,
)
def _classifier_builder(self, train):
return MaxentClassifier.train(
# "megam" cannot be the default algorithm since it requires compiling with ocaml
train,
algorithm="iis",
gaussian_prior_sigma=1,
trace=2,
)
def _english_wordlist(self):
try:
wl = self._en_wordlist
except AttributeError:
from nltk.corpus import words
self._en_wordlist = set(words.words("en-basic"))
wl = self._en_wordlist
return wl
def _feature_detector(self, tokens, index, history):
word = tokens[index][0]
pos = simplify_pos(tokens[index][1])
if index == 0:
prevword = prevprevword = None
prevpos = prevprevpos = None
prevshape = prevtag = prevprevtag = None
elif index == 1:
prevword = tokens[index - 1][0].lower()
prevprevword = None
prevpos = simplify_pos(tokens[index - 1][1])
prevprevpos = None
prevtag = history[index - 1][0]
prevshape = prevprevtag = None
else:
prevword = tokens[index - 1][0].lower()
prevprevword = tokens[index - 2][0].lower()
prevpos = simplify_pos(tokens[index - 1][1])
prevprevpos = simplify_pos(tokens[index - 2][1])
prevtag = history[index - 1]
prevprevtag = history[index - 2]
prevshape = shape(prevword)
if index == len(tokens) - 1:
nextword = nextnextword = None
nextpos = nextnextpos = None
elif index == len(tokens) - 2:
nextword = tokens[index + 1][0].lower()
nextpos = tokens[index + 1][1].lower()
nextnextword = None
nextnextpos = None
else:
nextword = tokens[index + 1][0].lower()
nextpos = tokens[index + 1][1].lower()
nextnextword = tokens[index + 2][0].lower()
nextnextpos = tokens[index + 2][1].lower()
# 89.6
features = {
"bias": True,
"shape": shape(word),
"wordlen": len(word),
"prefix3": word[:3].lower(),
"suffix3": word[-3:].lower(),
"pos": pos,
"word": word,
"en-wordlist": (word in self._english_wordlist()),
"prevtag": prevtag,
"prevpos": prevpos,
"nextpos": nextpos,
"prevword": prevword,
"nextword": nextword,
"word+nextpos": f"{word.lower()}+{nextpos}",
"pos+prevtag": f"{pos}+{prevtag}",
"shape+prevtag": f"{prevshape}+{prevtag}",
}
return features
class NEChunkParser(ChunkParserI):
"""
Expected input: list of pos-tagged words
"""
def __init__(self, train):
self._train(train)
def parse(self, tokens):
"""
Each token should be a pos-tagged word
"""
tagged = self._tagger.tag(tokens)
tree = self._tagged_to_parse(tagged)
return tree
def _train(self, corpus):
# Convert to tagged sequence
corpus = [self._parse_to_tagged(s) for s in corpus]
self._tagger = NEChunkParserTagger(train=corpus)
def _tagged_to_parse(self, tagged_tokens):
"""
Convert a list of tagged tokens to a chunk-parse tree.
"""
sent = Tree("S", [])
for tok, tag in tagged_tokens:
if tag == "O":
sent.append(tok)
elif tag.startswith("B-"):
sent.append(Tree(tag[2:], [tok]))
elif tag.startswith("I-"):
if sent and isinstance(sent[-1], Tree) and sent[-1].label() == tag[2:]:
sent[-1].append(tok)
else:
sent.append(Tree(tag[2:], [tok]))
return sent
@staticmethod
def _parse_to_tagged(sent):
"""
Convert a chunk-parse tree to a list of tagged tokens.
"""
toks = []
for child in sent:
if isinstance(child, Tree):
if len(child) == 0:
print("Warning -- empty chunk in sentence")
continue
toks.append((child[0], f"B-{child.label()}"))
for tok in child[1:]:
toks.append((tok, f"I-{child.label()}"))
else:
toks.append((child, "O"))
return toks
def shape(word):
if re.match(r"[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$", word, re.UNICODE):
return "number"
elif re.match(r"\W+$", word, re.UNICODE):
return "punct"
elif re.match(r"\w+$", word, re.UNICODE):
if word.istitle():
return "upcase"
elif word.islower():
return "downcase"
else:
return "mixedcase"
else:
return "other"
def simplify_pos(s):
if s.startswith("V"):
return "V"
else:
return s.split("-")[0]
def postag_tree(tree):
# Part-of-speech tagging.
words = tree.leaves()
tag_iter = (pos for (word, pos) in pos_tag(words))
newtree = Tree("S", [])
for child in tree:
if isinstance(child, Tree):
newtree.append(Tree(child.label(), []))
for subchild in child:
newtree[-1].append((subchild, next(tag_iter)))
else:
newtree.append((child, next(tag_iter)))
return newtree
def load_ace_data(roots, fmt="binary", skip_bnews=True):
for root in roots:
for root, dirs, files in os.walk(root):
if root.endswith("bnews") and skip_bnews:
continue
for f in files:
if f.endswith(".sgm"):
yield from load_ace_file(os.path.join(root, f), fmt)
def load_ace_file(textfile, fmt):
print(f" - {os.path.split(textfile)[1]}")
annfile = textfile + ".tmx.rdc.xml"
# Read the xml file, and get a list of entities
entities = []
with open(annfile) as infile:
xml = ET.parse(infile).getroot()
for entity in xml.findall("document/entity"):
typ = entity.find("entity_type").text
for mention in entity.findall("entity_mention"):
if mention.get("TYPE") != "NAME":
continue # only NEs
s = int(mention.find("head/charseq/start").text)
e = int(mention.find("head/charseq/end").text) + 1
entities.append((s, e, typ))
# Read the text file, and mark the entities.
with open(textfile) as infile:
text = infile.read()
# Strip XML tags, since they don't count towards the indices
text = re.sub("<(?!/?TEXT)[^>]+>", "", text)
# Blank out anything before/after <TEXT>
def subfunc(m):
return " " * (m.end() - m.start() - 6)
text = re.sub(r"[\s\S]*<TEXT>", subfunc, text)
text = re.sub(r"</TEXT>[\s\S]*", "", text)
# Simplify quotes
text = re.sub("``", ' "', text)
text = re.sub("''", '" ', text)
entity_types = {typ for (s, e, typ) in entities}
# Binary distinction (NE or not NE)
if fmt == "binary":
i = 0
toks = Tree("S", [])
for s, e, typ in sorted(entities):
if s < i:
s = i # Overlapping! Deal with this better?
if e <= s:
continue
toks.extend(word_tokenize(text[i:s]))
toks.append(Tree("NE", text[s:e].split()))
i = e
toks.extend(word_tokenize(text[i:]))
yield toks
# Multiclass distinction (NE type)
elif fmt == "multiclass":
i = 0
toks = Tree("S", [])
for s, e, typ in sorted(entities):
if s < i:
s = i # Overlapping! Deal with this better?
if e <= s:
continue
toks.extend(word_tokenize(text[i:s]))
toks.append(Tree(typ, text[s:e].split()))
i = e
toks.extend(word_tokenize(text[i:]))
yield toks
else:
raise ValueError("bad fmt value")
# This probably belongs in a more general-purpose location (as does
# the parse_to_tagged function).
def cmp_chunks(correct, guessed):
correct = NEChunkParser._parse_to_tagged(correct)
guessed = NEChunkParser._parse_to_tagged(guessed)
ellipsis = False
for (w, ct), (w, gt) in zip(correct, guessed):
if ct == gt == "O":
if not ellipsis:
print(f" {ct:15} {gt:15} {w}")
print(" {:15} {:15} {2}".format("...", "...", "..."))
ellipsis = True
else:
ellipsis = False
print(f" {ct:15} {gt:15} {w}")
# ======================================================================================
class Maxent_NE_Chunker(NEChunkParser):
"""
Expected input: list of pos-tagged words
"""
def __init__(self, fmt="multiclass"):
from nltk.data import find
self._fmt = fmt
self._tab_dir = find(f"chunkers/maxent_ne_chunker_tab/english_ace_{fmt}/")
self.load_params()
def load_params(self):
from nltk.classify.maxent import BinaryMaxentFeatureEncoding, load_maxent_params
wgt, mpg, lab, aon = load_maxent_params(self._tab_dir)
mc = MaxentClassifier(
BinaryMaxentFeatureEncoding(lab, mpg, alwayson_features=aon), wgt
)
self._tagger = NEChunkParserTagger(classifier=mc)
def save_params(self):
from nltk.classify.maxent import save_maxent_params
classif = self._tagger._classifier
ecg = classif._encoding
wgt = classif._weights
mpg = ecg._mapping
lab = ecg._labels
aon = ecg._alwayson
fmt = self._fmt
save_maxent_params(wgt, mpg, lab, aon, tab_dir=f"/tmp/english_ace_{fmt}/")
def build_model(fmt="multiclass"):
chunker = Maxent_NE_Chunker(fmt)
chunker.save_params()
return chunker
# ======================================================================================
"""
2004 update: pickles are not supported anymore.
Deprecated:
def build_model(fmt="binary"):
print("Loading training data...")
train_paths = [
find("corpora/ace_data/ace.dev"),
find("corpora/ace_data/ace.heldout"),
find("corpora/ace_data/bbn.dev"),
find("corpora/ace_data/muc.dev"),
]
train_trees = load_ace_data(train_paths, fmt)
train_data = [postag_tree(t) for t in train_trees]
print("Training...")
cp = NEChunkParser(train_data)
del train_data
print("Loading eval data...")
eval_paths = [find("corpora/ace_data/ace.eval")]
eval_trees = load_ace_data(eval_paths, fmt)
eval_data = [postag_tree(t) for t in eval_trees]
print("Evaluating...")
chunkscore = ChunkScore()
for i, correct in enumerate(eval_data):
guess = cp.parse(correct.leaves())
chunkscore.score(correct, guess)
if i < 3:
cmp_chunks(correct, guess)
print(chunkscore)
outfilename = f"/tmp/ne_chunker_{fmt}.pickle"
print(f"Saving chunker to {outfilename}...")
with open(outfilename, "wb") as outfile:
pickle.dump(cp, outfile, -1)
return cp
"""
if __name__ == "__main__":
# Make sure that the object has the right class name:
build_model("binary")
build_model("multiclass")

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,642 @@
# Natural Language Toolkit: Chunk format conversions
#
# Copyright (C) 2001-2025 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
# Steven Bird <stevenbird1@gmail.com> (minor additions)
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT
import re
from nltk.metrics import accuracy as _accuracy
from nltk.tag.mapping import map_tag
from nltk.tag.util import str2tuple
from nltk.tree import Tree
##//////////////////////////////////////////////////////
## EVALUATION
##//////////////////////////////////////////////////////
def accuracy(chunker, gold):
"""
Score the accuracy of the chunker against the gold standard.
Strip the chunk information from the gold standard and rechunk it using
the chunker, then compute the accuracy score.
:type chunker: ChunkParserI
:param chunker: The chunker being evaluated.
:type gold: tree
:param gold: The chunk structures to score the chunker on.
:rtype: float
"""
gold_tags = []
test_tags = []
for gold_tree in gold:
test_tree = chunker.parse(gold_tree.flatten())
gold_tags += tree2conlltags(gold_tree)
test_tags += tree2conlltags(test_tree)
# print 'GOLD:', gold_tags[:50]
# print 'TEST:', test_tags[:50]
return _accuracy(gold_tags, test_tags)
# Patched for increased performance by Yoav Goldberg <yoavg@cs.bgu.ac.il>, 2006-01-13
# -- statistics are evaluated only on demand, instead of at every sentence evaluation
#
# SB: use nltk.metrics for precision/recall scoring?
#
class ChunkScore:
"""
A utility class for scoring chunk parsers. ``ChunkScore`` can
evaluate a chunk parser's output, based on a number of statistics
(precision, recall, f-measure, misssed chunks, incorrect chunks).
It can also combine the scores from the parsing of multiple texts;
this makes it significantly easier to evaluate a chunk parser that
operates one sentence at a time.
Texts are evaluated with the ``score`` method. The results of
evaluation can be accessed via a number of accessor methods, such
as ``precision`` and ``f_measure``. A typical use of the
``ChunkScore`` class is::
>>> chunkscore = ChunkScore() # doctest: +SKIP
>>> for correct in correct_sentences: # doctest: +SKIP
... guess = chunkparser.parse(correct.leaves()) # doctest: +SKIP
... chunkscore.score(correct, guess) # doctest: +SKIP
>>> print('F Measure:', chunkscore.f_measure()) # doctest: +SKIP
F Measure: 0.823
:ivar kwargs: Keyword arguments:
- max_tp_examples: The maximum number actual examples of true
positives to record. This affects the ``correct`` member
function: ``correct`` will not return more than this number
of true positive examples. This does *not* affect any of
the numerical metrics (precision, recall, or f-measure)
- max_fp_examples: The maximum number actual examples of false
positives to record. This affects the ``incorrect`` member
function and the ``guessed`` member function: ``incorrect``
will not return more than this number of examples, and
``guessed`` will not return more than this number of true
positive examples. This does *not* affect any of the
numerical metrics (precision, recall, or f-measure)
- max_fn_examples: The maximum number actual examples of false
negatives to record. This affects the ``missed`` member
function and the ``correct`` member function: ``missed``
will not return more than this number of examples, and
``correct`` will not return more than this number of true
negative examples. This does *not* affect any of the
numerical metrics (precision, recall, or f-measure)
- chunk_label: A regular expression indicating which chunks
should be compared. Defaults to ``'.*'`` (i.e., all chunks).
:type _tp: list(Token)
:ivar _tp: List of true positives
:type _fp: list(Token)
:ivar _fp: List of false positives
:type _fn: list(Token)
:ivar _fn: List of false negatives
:type _tp_num: int
:ivar _tp_num: Number of true positives
:type _fp_num: int
:ivar _fp_num: Number of false positives
:type _fn_num: int
:ivar _fn_num: Number of false negatives.
"""
def __init__(self, **kwargs):
self._correct = set()
self._guessed = set()
self._tp = set()
self._fp = set()
self._fn = set()
self._max_tp = kwargs.get("max_tp_examples", 100)
self._max_fp = kwargs.get("max_fp_examples", 100)
self._max_fn = kwargs.get("max_fn_examples", 100)
self._chunk_label = kwargs.get("chunk_label", ".*")
self._tp_num = 0
self._fp_num = 0
self._fn_num = 0
self._count = 0
self._tags_correct = 0.0
self._tags_total = 0.0
self._measuresNeedUpdate = False
def _updateMeasures(self):
if self._measuresNeedUpdate:
self._tp = self._guessed & self._correct
self._fn = self._correct - self._guessed
self._fp = self._guessed - self._correct
self._tp_num = len(self._tp)
self._fp_num = len(self._fp)
self._fn_num = len(self._fn)
self._measuresNeedUpdate = False
def score(self, correct, guessed):
"""
Given a correctly chunked sentence, score another chunked
version of the same sentence.
:type correct: chunk structure
:param correct: The known-correct ("gold standard") chunked
sentence.
:type guessed: chunk structure
:param guessed: The chunked sentence to be scored.
"""
self._correct |= _chunksets(correct, self._count, self._chunk_label)
self._guessed |= _chunksets(guessed, self._count, self._chunk_label)
self._count += 1
self._measuresNeedUpdate = True
# Keep track of per-tag accuracy (if possible)
try:
correct_tags = tree2conlltags(correct)
guessed_tags = tree2conlltags(guessed)
except ValueError:
# This exception case is for nested chunk structures,
# where tree2conlltags will fail with a ValueError: "Tree
# is too deeply nested to be printed in CoNLL format."
correct_tags = guessed_tags = ()
self._tags_total += len(correct_tags)
self._tags_correct += sum(
1 for (t, g) in zip(guessed_tags, correct_tags) if t == g
)
def accuracy(self):
"""
Return the overall tag-based accuracy for all text that have
been scored by this ``ChunkScore``, using the IOB (conll2000)
tag encoding.
:rtype: float
"""
if self._tags_total == 0:
return 1
return self._tags_correct / self._tags_total
def precision(self):
"""
Return the overall precision for all texts that have been
scored by this ``ChunkScore``.
:rtype: float
"""
self._updateMeasures()
div = self._tp_num + self._fp_num
if div == 0:
return 0
else:
return self._tp_num / div
def recall(self):
"""
Return the overall recall for all texts that have been
scored by this ``ChunkScore``.
:rtype: float
"""
self._updateMeasures()
div = self._tp_num + self._fn_num
if div == 0:
return 0
else:
return self._tp_num / div
def f_measure(self, alpha=0.5):
"""
Return the overall F measure for all texts that have been
scored by this ``ChunkScore``.
:param alpha: the relative weighting of precision and recall.
Larger alpha biases the score towards the precision value,
while smaller alpha biases the score towards the recall
value. ``alpha`` should have a value in the range [0,1].
:type alpha: float
:rtype: float
"""
self._updateMeasures()
p = self.precision()
r = self.recall()
if p == 0 or r == 0: # what if alpha is 0 or 1?
return 0
return 1 / (alpha / p + (1 - alpha) / r)
def missed(self):
"""
Return the chunks which were included in the
correct chunk structures, but not in the guessed chunk
structures, listed in input order.
:rtype: list of chunks
"""
self._updateMeasures()
chunks = list(self._fn)
return [c[1] for c in chunks] # discard position information
def incorrect(self):
"""
Return the chunks which were included in the guessed chunk structures,
but not in the correct chunk structures, listed in input order.
:rtype: list of chunks
"""
self._updateMeasures()
chunks = list(self._fp)
return [c[1] for c in chunks] # discard position information
def correct(self):
"""
Return the chunks which were included in the correct
chunk structures, listed in input order.
:rtype: list of chunks
"""
chunks = list(self._correct)
return [c[1] for c in chunks] # discard position information
def guessed(self):
"""
Return the chunks which were included in the guessed
chunk structures, listed in input order.
:rtype: list of chunks
"""
chunks = list(self._guessed)
return [c[1] for c in chunks] # discard position information
def __len__(self):
self._updateMeasures()
return self._tp_num + self._fn_num
def __repr__(self):
"""
Return a concise representation of this ``ChunkScoring``.
:rtype: str
"""
return "<ChunkScoring of " + repr(len(self)) + " chunks>"
def __str__(self):
"""
Return a verbose representation of this ``ChunkScoring``.
This representation includes the precision, recall, and
f-measure scores. For other information about the score,
use the accessor methods (e.g., ``missed()`` and ``incorrect()``).
:rtype: str
"""
return (
"ChunkParse score:\n"
+ f" IOB Accuracy: {self.accuracy() * 100:5.1f}%\n"
+ f" Precision: {self.precision() * 100:5.1f}%\n"
+ f" Recall: {self.recall() * 100:5.1f}%\n"
+ f" F-Measure: {self.f_measure() * 100:5.1f}%"
)
# extract chunks, and assign unique id, the absolute position of
# the first word of the chunk
def _chunksets(t, count, chunk_label):
pos = 0
chunks = []
for child in t:
if isinstance(child, Tree):
if re.match(chunk_label, child.label()):
chunks.append(((count, pos), child.freeze()))
pos += len(child.leaves())
else:
pos += 1
return set(chunks)
def tagstr2tree(
s, chunk_label="NP", root_label="S", sep="/", source_tagset=None, target_tagset=None
):
"""
Divide a string of bracketted tagged text into
chunks and unchunked tokens, and produce a Tree.
Chunks are marked by square brackets (``[...]``). Words are
delimited by whitespace, and each word should have the form
``text/tag``. Words that do not contain a slash are
assigned a ``tag`` of None.
:param s: The string to be converted
:type s: str
:param chunk_label: The label to use for chunk nodes
:type chunk_label: str
:param root_label: The label to use for the root of the tree
:type root_label: str
:rtype: Tree
"""
WORD_OR_BRACKET = re.compile(r"\[|\]|[^\[\]\s]+")
stack = [Tree(root_label, [])]
for match in WORD_OR_BRACKET.finditer(s):
text = match.group()
if text[0] == "[":
if len(stack) != 1:
raise ValueError(f"Unexpected [ at char {match.start():d}")
chunk = Tree(chunk_label, [])
stack[-1].append(chunk)
stack.append(chunk)
elif text[0] == "]":
if len(stack) != 2:
raise ValueError(f"Unexpected ] at char {match.start():d}")
stack.pop()
else:
if sep is None:
stack[-1].append(text)
else:
word, tag = str2tuple(text, sep)
if source_tagset and target_tagset:
tag = map_tag(source_tagset, target_tagset, tag)
stack[-1].append((word, tag))
if len(stack) != 1:
raise ValueError(f"Expected ] at char {len(s):d}")
return stack[0]
### CONLL
_LINE_RE = re.compile(r"(\S+)\s+(\S+)\s+([IOB])-?(\S+)?")
def conllstr2tree(s, chunk_types=("NP", "PP", "VP"), root_label="S"):
"""
Return a chunk structure for a single sentence
encoded in the given CONLL 2000 style string.
This function converts a CoNLL IOB string into a tree.
It uses the specified chunk types
(defaults to NP, PP and VP), and creates a tree rooted at a node
labeled S (by default).
:param s: The CoNLL string to be converted.
:type s: str
:param chunk_types: The chunk types to be converted.
:type chunk_types: tuple
:param root_label: The node label to use for the root.
:type root_label: str
:rtype: Tree
"""
stack = [Tree(root_label, [])]
for lineno, line in enumerate(s.split("\n")):
if not line.strip():
continue
# Decode the line.
match = _LINE_RE.match(line)
if match is None:
raise ValueError(f"Error on line {lineno:d}")
(word, tag, state, chunk_type) = match.groups()
# If it's a chunk type we don't care about, treat it as O.
if chunk_types is not None and chunk_type not in chunk_types:
state = "O"
# For "Begin"/"Outside", finish any completed chunks -
# also do so for "Inside" which don't match the previous token.
mismatch_I = state == "I" and chunk_type != stack[-1].label()
if state in "BO" or mismatch_I:
if len(stack) == 2:
stack.pop()
# For "Begin", start a new chunk.
if state == "B" or mismatch_I:
chunk = Tree(chunk_type, [])
stack[-1].append(chunk)
stack.append(chunk)
# Add the new word token.
stack[-1].append((word, tag))
return stack[0]
def tree2conlltags(t):
"""
Return a list of 3-tuples containing ``(word, tag, IOB-tag)``.
Convert a tree to the CoNLL IOB tag format.
:param t: The tree to be converted.
:type t: Tree
:rtype: list(tuple)
"""
tags = []
for child in t:
try:
category = child.label()
prefix = "B-"
for contents in child:
if isinstance(contents, Tree):
raise ValueError(
"Tree is too deeply nested to be printed in CoNLL format"
)
tags.append((contents[0], contents[1], prefix + category))
prefix = "I-"
except AttributeError:
tags.append((child[0], child[1], "O"))
return tags
def conlltags2tree(
sentence, chunk_types=("NP", "PP", "VP"), root_label="S", strict=False
):
"""
Convert the CoNLL IOB format to a tree.
"""
tree = Tree(root_label, [])
for word, postag, chunktag in sentence:
if chunktag is None:
if strict:
raise ValueError("Bad conll tag sequence")
else:
# Treat as O
tree.append((word, postag))
elif chunktag.startswith("B-"):
tree.append(Tree(chunktag[2:], [(word, postag)]))
elif chunktag.startswith("I-"):
if (
len(tree) == 0
or not isinstance(tree[-1], Tree)
or tree[-1].label() != chunktag[2:]
):
if strict:
raise ValueError("Bad conll tag sequence")
else:
# Treat as B-*
tree.append(Tree(chunktag[2:], [(word, postag)]))
else:
tree[-1].append((word, postag))
elif chunktag == "O":
tree.append((word, postag))
else:
raise ValueError(f"Bad conll tag {chunktag!r}")
return tree
def tree2conllstr(t):
"""
Return a multiline string where each line contains a word, tag and IOB tag.
Convert a tree to the CoNLL IOB string format
:param t: The tree to be converted.
:type t: Tree
:rtype: str
"""
lines = [" ".join(token) for token in tree2conlltags(t)]
return "\n".join(lines)
### IEER
_IEER_DOC_RE = re.compile(
r"<DOC>\s*"
r"(<DOCNO>\s*(?P<docno>.+?)\s*</DOCNO>\s*)?"
r"(<DOCTYPE>\s*(?P<doctype>.+?)\s*</DOCTYPE>\s*)?"
r"(<DATE_TIME>\s*(?P<date_time>.+?)\s*</DATE_TIME>\s*)?"
r"<BODY>\s*"
r"(<HEADLINE>\s*(?P<headline>.+?)\s*</HEADLINE>\s*)?"
r"<TEXT>(?P<text>.*?)</TEXT>\s*"
r"</BODY>\s*</DOC>\s*",
re.DOTALL,
)
_IEER_TYPE_RE = re.compile(r'<b_\w+\s+[^>]*?type="(?P<type>\w+)"')
def _ieer_read_text(s, root_label):
stack = [Tree(root_label, [])]
# s will be None if there is no headline in the text
# return the empty list in place of a Tree
if s is None:
return []
for piece_m in re.finditer(r"<[^>]+>|[^\s<]+", s):
piece = piece_m.group()
try:
if piece.startswith("<b_"):
m = _IEER_TYPE_RE.match(piece)
if m is None:
print("XXXX", piece)
chunk = Tree(m.group("type"), [])
stack[-1].append(chunk)
stack.append(chunk)
elif piece.startswith("<e_"):
stack.pop()
# elif piece.startswith('<'):
# print "ERROR:", piece
# raise ValueError # Unexpected HTML
else:
stack[-1].append(piece)
except (IndexError, ValueError) as e:
raise ValueError(
f"Bad IEER string (error at character {piece_m.start():d})"
) from e
if len(stack) != 1:
raise ValueError("Bad IEER string")
return stack[0]
def ieerstr2tree(
s,
chunk_types=[
"LOCATION",
"ORGANIZATION",
"PERSON",
"DURATION",
"DATE",
"CARDINAL",
"PERCENT",
"MONEY",
"MEASURE",
],
root_label="S",
):
"""
Return a chunk structure containing the chunked tagged text that is
encoded in the given IEER style string.
Convert a string of chunked tagged text in the IEER named
entity format into a chunk structure. Chunks are of several
types, LOCATION, ORGANIZATION, PERSON, DURATION, DATE, CARDINAL,
PERCENT, MONEY, and MEASURE.
:rtype: Tree
"""
# Try looking for a single document. If that doesn't work, then just
# treat everything as if it was within the <TEXT>...</TEXT>.
m = _IEER_DOC_RE.match(s)
if m:
return {
"text": _ieer_read_text(m.group("text"), root_label),
"docno": m.group("docno"),
"doctype": m.group("doctype"),
"date_time": m.group("date_time"),
#'headline': m.group('headline')
# we want to capture NEs in the headline too!
"headline": _ieer_read_text(m.group("headline"), root_label),
}
else:
return _ieer_read_text(s, root_label)
def demo():
s = "[ Pierre/NNP Vinken/NNP ] ,/, [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB [ the/DT board/NN ] ./."
import nltk
t = nltk.chunk.tagstr2tree(s, chunk_label="NP")
t.pprint()
print()
s = """
These DT B-NP
research NN I-NP
protocols NNS I-NP
offer VBP B-VP
to TO B-PP
the DT B-NP
patient NN I-NP
not RB O
only RB O
the DT B-NP
very RB I-NP
best JJS I-NP
therapy NN I-NP
which WDT B-NP
we PRP B-NP
have VBP B-VP
established VBN I-VP
today NN B-NP
but CC B-NP
also RB I-NP
the DT B-NP
hope NN I-NP
of IN B-PP
something NN B-NP
still RB B-ADJP
better JJR I-ADJP
. . O
"""
conll_tree = conllstr2tree(s, chunk_types=("NP", "PP"))
conll_tree.pprint()
# Demonstrate CoNLL output
print("CoNLL output:")
print(nltk.chunk.tree2conllstr(conll_tree))
print()
if __name__ == "__main__":
demo()