updates
This commit is contained in:
205
Backend/venv/lib/python3.12/site-packages/nltk/chunk/__init__.py
Normal file
205
Backend/venv/lib/python3.12/site-packages/nltk/chunk/__init__.py
Normal file
@@ -0,0 +1,205 @@
|
||||
# Natural Language Toolkit: Chunkers
|
||||
#
|
||||
# Copyright (C) 2001-2025 NLTK Project
|
||||
# Author: Steven Bird <stevenbird1@gmail.com>
|
||||
# Edward Loper <edloper@gmail.com>
|
||||
# URL: <https://www.nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
#
|
||||
|
||||
"""
|
||||
Classes and interfaces for identifying non-overlapping linguistic
|
||||
groups (such as base noun phrases) in unrestricted text. This task is
|
||||
called "chunk parsing" or "chunking", and the identified groups are
|
||||
called "chunks". The chunked text is represented using a shallow
|
||||
tree called a "chunk structure." A chunk structure is a tree
|
||||
containing tokens and chunks, where each chunk is a subtree containing
|
||||
only tokens. For example, the chunk structure for base noun phrase
|
||||
chunks in the sentence "I saw the big dog on the hill" is::
|
||||
|
||||
(SENTENCE:
|
||||
(NP: <I>)
|
||||
<saw>
|
||||
(NP: <the> <big> <dog>)
|
||||
<on>
|
||||
(NP: <the> <hill>))
|
||||
|
||||
To convert a chunk structure back to a list of tokens, simply use the
|
||||
chunk structure's ``leaves()`` method.
|
||||
|
||||
This module defines ``ChunkParserI``, a standard interface for
|
||||
chunking texts; and ``RegexpChunkParser``, a regular-expression based
|
||||
implementation of that interface. It also defines ``ChunkScore``, a
|
||||
utility class for scoring chunk parsers.
|
||||
|
||||
RegexpChunkParser
|
||||
=================
|
||||
|
||||
``RegexpChunkParser`` is an implementation of the chunk parser interface
|
||||
that uses regular-expressions over tags to chunk a text. Its
|
||||
``parse()`` method first constructs a ``ChunkString``, which encodes a
|
||||
particular chunking of the input text. Initially, nothing is
|
||||
chunked. ``parse.RegexpChunkParser`` then applies a sequence of
|
||||
``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies
|
||||
the chunking that it encodes. Finally, the ``ChunkString`` is
|
||||
transformed back into a chunk structure, which is returned.
|
||||
|
||||
``RegexpChunkParser`` can only be used to chunk a single kind of phrase.
|
||||
For example, you can use an ``RegexpChunkParser`` to chunk the noun
|
||||
phrases in a text, or the verb phrases in a text; but you can not
|
||||
use it to simultaneously chunk both noun phrases and verb phrases in
|
||||
the same text. (This is a limitation of ``RegexpChunkParser``, not of
|
||||
chunk parsers in general.)
|
||||
|
||||
RegexpChunkRules
|
||||
----------------
|
||||
|
||||
A ``RegexpChunkRule`` is a transformational rule that updates the
|
||||
chunking of a text by modifying its ``ChunkString``. Each
|
||||
``RegexpChunkRule`` defines the ``apply()`` method, which modifies
|
||||
the chunking encoded by a ``ChunkString``. The
|
||||
``RegexpChunkRule`` class itself can be used to implement any
|
||||
transformational rule based on regular expressions. There are
|
||||
also a number of subclasses, which can be used to implement
|
||||
simpler types of rules:
|
||||
|
||||
- ``ChunkRule`` chunks anything that matches a given regular
|
||||
expression.
|
||||
- ``StripRule`` strips anything that matches a given regular
|
||||
expression.
|
||||
- ``UnChunkRule`` will un-chunk any chunk that matches a given
|
||||
regular expression.
|
||||
- ``MergeRule`` can be used to merge two contiguous chunks.
|
||||
- ``SplitRule`` can be used to split a single chunk into two
|
||||
smaller chunks.
|
||||
- ``ExpandLeftRule`` will expand a chunk to incorporate new
|
||||
unchunked material on the left.
|
||||
- ``ExpandRightRule`` will expand a chunk to incorporate new
|
||||
unchunked material on the right.
|
||||
|
||||
Tag Patterns
|
||||
~~~~~~~~~~~~
|
||||
|
||||
A ``RegexpChunkRule`` uses a modified version of regular
|
||||
expression patterns, called "tag patterns". Tag patterns are
|
||||
used to match sequences of tags. Examples of tag patterns are::
|
||||
|
||||
r'(<DT>|<JJ>|<NN>)+'
|
||||
r'<NN>+'
|
||||
r'<NN.*>'
|
||||
|
||||
The differences between regular expression patterns and tag
|
||||
patterns are:
|
||||
|
||||
- In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so
|
||||
``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not
|
||||
``'<NN'`` followed by one or more repetitions of ``'>'``.
|
||||
- Whitespace in tag patterns is ignored. So
|
||||
``'<DT> | <NN>'`` is equivalent to ``'<DT>|<NN>'``
|
||||
- In tag patterns, ``'.'`` is equivalent to ``'[^{}<>]'``; so
|
||||
``'<NN.*>'`` matches any single tag starting with ``'NN'``.
|
||||
|
||||
The function ``tag_pattern2re_pattern`` can be used to transform
|
||||
a tag pattern to an equivalent regular expression pattern.
|
||||
|
||||
Efficiency
|
||||
----------
|
||||
|
||||
Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a
|
||||
rate of about 300 tokens/second, with a moderately complex rule set.
|
||||
|
||||
There may be problems if ``RegexpChunkParser`` is used with more than
|
||||
5,000 tokens at a time. In particular, evaluation of some regular
|
||||
expressions may cause the Python regular expression engine to
|
||||
exceed its maximum recursion depth. We have attempted to minimize
|
||||
these problems, but it is impossible to avoid them completely. We
|
||||
therefore recommend that you apply the chunk parser to a single
|
||||
sentence at a time.
|
||||
|
||||
Emacs Tip
|
||||
---------
|
||||
|
||||
If you evaluate the following elisp expression in emacs, it will
|
||||
colorize a ``ChunkString`` when you use an interactive python shell
|
||||
with emacs or xemacs ("C-c !")::
|
||||
|
||||
(let ()
|
||||
(defconst comint-mode-font-lock-keywords
|
||||
'(("<[^>]+>" 0 'font-lock-reference-face)
|
||||
("[{}]" 0 'font-lock-function-name-face)))
|
||||
(add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
|
||||
|
||||
You can evaluate this code by copying it to a temporary buffer,
|
||||
placing the cursor after the last close parenthesis, and typing
|
||||
"``C-x C-e``". You should evaluate it before running the interactive
|
||||
session. The change will last until you close emacs.
|
||||
|
||||
Unresolved Issues
|
||||
-----------------
|
||||
|
||||
If we use the ``re`` module for regular expressions, Python's
|
||||
regular expression engine generates "maximum recursion depth
|
||||
exceeded" errors when processing very large texts, even for
|
||||
regular expressions that should not require any recursion. We
|
||||
therefore use the ``pre`` module instead. But note that ``pre``
|
||||
does not include Unicode support, so this module will not work
|
||||
with unicode strings. Note also that ``pre`` regular expressions
|
||||
are not quite as advanced as ``re`` ones (e.g., no leftward
|
||||
zero-length assertions).
|
||||
|
||||
:type CHUNK_TAG_PATTERN: regexp
|
||||
:var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
|
||||
pattern is valid.
|
||||
"""
|
||||
|
||||
from nltk.chunk.api import ChunkParserI
|
||||
from nltk.chunk.named_entity import Maxent_NE_Chunker
|
||||
from nltk.chunk.regexp import RegexpChunkParser, RegexpParser
|
||||
from nltk.chunk.util import (
|
||||
ChunkScore,
|
||||
accuracy,
|
||||
conllstr2tree,
|
||||
conlltags2tree,
|
||||
ieerstr2tree,
|
||||
tagstr2tree,
|
||||
tree2conllstr,
|
||||
tree2conlltags,
|
||||
)
|
||||
|
||||
|
||||
def ne_chunker(fmt="multiclass"):
|
||||
"""
|
||||
Load NLTK's currently recommended named entity chunker.
|
||||
"""
|
||||
return Maxent_NE_Chunker(fmt)
|
||||
|
||||
|
||||
def ne_chunk(tagged_tokens, binary=False):
|
||||
"""
|
||||
Use NLTK's currently recommended named entity chunker to
|
||||
chunk the given list of tagged tokens.
|
||||
|
||||
>>> from nltk.chunk import ne_chunk
|
||||
>>> from nltk.corpus import treebank
|
||||
>>> from pprint import pprint
|
||||
>>> pprint(ne_chunk(treebank.tagged_sents()[2][8:14])) # doctest: +NORMALIZE_WHITESPACE
|
||||
Tree('S', [('chairman', 'NN'), ('of', 'IN'), Tree('ORGANIZATION', [('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP')]), ('PLC', 'NNP')])
|
||||
|
||||
"""
|
||||
if binary:
|
||||
chunker = ne_chunker(fmt="binary")
|
||||
else:
|
||||
chunker = ne_chunker()
|
||||
return chunker.parse(tagged_tokens)
|
||||
|
||||
|
||||
def ne_chunk_sents(tagged_sentences, binary=False):
|
||||
"""
|
||||
Use NLTK's currently recommended named entity chunker to chunk the
|
||||
given list of tagged sentences, each consisting of a list of tagged tokens.
|
||||
"""
|
||||
if binary:
|
||||
chunker = ne_chunker(fmt="binary")
|
||||
else:
|
||||
chunker = ne_chunker()
|
||||
return chunker.parse_sents(tagged_sentences)
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
56
Backend/venv/lib/python3.12/site-packages/nltk/chunk/api.py
Normal file
56
Backend/venv/lib/python3.12/site-packages/nltk/chunk/api.py
Normal file
@@ -0,0 +1,56 @@
|
||||
# Natural Language Toolkit: Chunk parsing API
|
||||
#
|
||||
# Copyright (C) 2001-2025 NLTK Project
|
||||
# Author: Edward Loper <edloper@gmail.com>
|
||||
# Steven Bird <stevenbird1@gmail.com> (minor additions)
|
||||
# URL: <https://www.nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
|
||||
##//////////////////////////////////////////////////////
|
||||
## Chunk Parser Interface
|
||||
##//////////////////////////////////////////////////////
|
||||
|
||||
from nltk.chunk.util import ChunkScore
|
||||
from nltk.internals import deprecated
|
||||
from nltk.parse import ParserI
|
||||
|
||||
|
||||
class ChunkParserI(ParserI):
|
||||
"""
|
||||
A processing interface for identifying non-overlapping groups in
|
||||
unrestricted text. Typically, chunk parsers are used to find base
|
||||
syntactic constituents, such as base noun phrases. Unlike
|
||||
``ParserI``, ``ChunkParserI`` guarantees that the ``parse()`` method
|
||||
will always generate a parse.
|
||||
"""
|
||||
|
||||
def parse(self, tokens):
|
||||
"""
|
||||
Return the best chunk structure for the given tokens
|
||||
and return a tree.
|
||||
|
||||
:param tokens: The list of (word, tag) tokens to be chunked.
|
||||
:type tokens: list(tuple)
|
||||
:rtype: Tree
|
||||
"""
|
||||
raise NotImplementedError()
|
||||
|
||||
@deprecated("Use accuracy(gold) instead.")
|
||||
def evaluate(self, gold):
|
||||
return self.accuracy(gold)
|
||||
|
||||
def accuracy(self, gold):
|
||||
"""
|
||||
Score the accuracy of the chunker against the gold standard.
|
||||
Remove the chunking the gold standard text, rechunk it using
|
||||
the chunker, and return a ``ChunkScore`` object
|
||||
reflecting the performance of this chunk parser.
|
||||
|
||||
:type gold: list(Tree)
|
||||
:param gold: The list of chunked sentences to score the chunker on.
|
||||
:rtype: ChunkScore
|
||||
"""
|
||||
chunkscore = ChunkScore()
|
||||
for correct in gold:
|
||||
chunkscore.score(correct, self.parse(correct.leaves()))
|
||||
return chunkscore
|
||||
@@ -0,0 +1,407 @@
|
||||
# Natural Language Toolkit: Chunk parsing API
|
||||
#
|
||||
# Copyright (C) 2001-2025 NLTK Project
|
||||
# Author: Edward Loper <edloper@gmail.com>
|
||||
# Eric Kafe <kafe.eric@gmail.com> (tab-format models)
|
||||
# URL: <https://www.nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
|
||||
"""
|
||||
Named entity chunker
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
from nltk.tag import ClassifierBasedTagger, pos_tag
|
||||
|
||||
try:
|
||||
from nltk.classify import MaxentClassifier
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
from nltk.chunk.api import ChunkParserI
|
||||
from nltk.chunk.util import ChunkScore
|
||||
from nltk.data import find
|
||||
from nltk.tokenize import word_tokenize
|
||||
from nltk.tree import Tree
|
||||
|
||||
|
||||
class NEChunkParserTagger(ClassifierBasedTagger):
|
||||
"""
|
||||
The IOB tagger used by the chunk parser.
|
||||
"""
|
||||
|
||||
def __init__(self, train=None, classifier=None):
|
||||
ClassifierBasedTagger.__init__(
|
||||
self,
|
||||
train=train,
|
||||
classifier_builder=self._classifier_builder,
|
||||
classifier=classifier,
|
||||
)
|
||||
|
||||
def _classifier_builder(self, train):
|
||||
return MaxentClassifier.train(
|
||||
# "megam" cannot be the default algorithm since it requires compiling with ocaml
|
||||
train,
|
||||
algorithm="iis",
|
||||
gaussian_prior_sigma=1,
|
||||
trace=2,
|
||||
)
|
||||
|
||||
def _english_wordlist(self):
|
||||
try:
|
||||
wl = self._en_wordlist
|
||||
except AttributeError:
|
||||
from nltk.corpus import words
|
||||
|
||||
self._en_wordlist = set(words.words("en-basic"))
|
||||
wl = self._en_wordlist
|
||||
return wl
|
||||
|
||||
def _feature_detector(self, tokens, index, history):
|
||||
word = tokens[index][0]
|
||||
pos = simplify_pos(tokens[index][1])
|
||||
if index == 0:
|
||||
prevword = prevprevword = None
|
||||
prevpos = prevprevpos = None
|
||||
prevshape = prevtag = prevprevtag = None
|
||||
elif index == 1:
|
||||
prevword = tokens[index - 1][0].lower()
|
||||
prevprevword = None
|
||||
prevpos = simplify_pos(tokens[index - 1][1])
|
||||
prevprevpos = None
|
||||
prevtag = history[index - 1][0]
|
||||
prevshape = prevprevtag = None
|
||||
else:
|
||||
prevword = tokens[index - 1][0].lower()
|
||||
prevprevword = tokens[index - 2][0].lower()
|
||||
prevpos = simplify_pos(tokens[index - 1][1])
|
||||
prevprevpos = simplify_pos(tokens[index - 2][1])
|
||||
prevtag = history[index - 1]
|
||||
prevprevtag = history[index - 2]
|
||||
prevshape = shape(prevword)
|
||||
if index == len(tokens) - 1:
|
||||
nextword = nextnextword = None
|
||||
nextpos = nextnextpos = None
|
||||
elif index == len(tokens) - 2:
|
||||
nextword = tokens[index + 1][0].lower()
|
||||
nextpos = tokens[index + 1][1].lower()
|
||||
nextnextword = None
|
||||
nextnextpos = None
|
||||
else:
|
||||
nextword = tokens[index + 1][0].lower()
|
||||
nextpos = tokens[index + 1][1].lower()
|
||||
nextnextword = tokens[index + 2][0].lower()
|
||||
nextnextpos = tokens[index + 2][1].lower()
|
||||
|
||||
# 89.6
|
||||
features = {
|
||||
"bias": True,
|
||||
"shape": shape(word),
|
||||
"wordlen": len(word),
|
||||
"prefix3": word[:3].lower(),
|
||||
"suffix3": word[-3:].lower(),
|
||||
"pos": pos,
|
||||
"word": word,
|
||||
"en-wordlist": (word in self._english_wordlist()),
|
||||
"prevtag": prevtag,
|
||||
"prevpos": prevpos,
|
||||
"nextpos": nextpos,
|
||||
"prevword": prevword,
|
||||
"nextword": nextword,
|
||||
"word+nextpos": f"{word.lower()}+{nextpos}",
|
||||
"pos+prevtag": f"{pos}+{prevtag}",
|
||||
"shape+prevtag": f"{prevshape}+{prevtag}",
|
||||
}
|
||||
|
||||
return features
|
||||
|
||||
|
||||
class NEChunkParser(ChunkParserI):
|
||||
"""
|
||||
Expected input: list of pos-tagged words
|
||||
"""
|
||||
|
||||
def __init__(self, train):
|
||||
self._train(train)
|
||||
|
||||
def parse(self, tokens):
|
||||
"""
|
||||
Each token should be a pos-tagged word
|
||||
"""
|
||||
tagged = self._tagger.tag(tokens)
|
||||
tree = self._tagged_to_parse(tagged)
|
||||
return tree
|
||||
|
||||
def _train(self, corpus):
|
||||
# Convert to tagged sequence
|
||||
corpus = [self._parse_to_tagged(s) for s in corpus]
|
||||
|
||||
self._tagger = NEChunkParserTagger(train=corpus)
|
||||
|
||||
def _tagged_to_parse(self, tagged_tokens):
|
||||
"""
|
||||
Convert a list of tagged tokens to a chunk-parse tree.
|
||||
"""
|
||||
sent = Tree("S", [])
|
||||
|
||||
for tok, tag in tagged_tokens:
|
||||
if tag == "O":
|
||||
sent.append(tok)
|
||||
elif tag.startswith("B-"):
|
||||
sent.append(Tree(tag[2:], [tok]))
|
||||
elif tag.startswith("I-"):
|
||||
if sent and isinstance(sent[-1], Tree) and sent[-1].label() == tag[2:]:
|
||||
sent[-1].append(tok)
|
||||
else:
|
||||
sent.append(Tree(tag[2:], [tok]))
|
||||
return sent
|
||||
|
||||
@staticmethod
|
||||
def _parse_to_tagged(sent):
|
||||
"""
|
||||
Convert a chunk-parse tree to a list of tagged tokens.
|
||||
"""
|
||||
toks = []
|
||||
for child in sent:
|
||||
if isinstance(child, Tree):
|
||||
if len(child) == 0:
|
||||
print("Warning -- empty chunk in sentence")
|
||||
continue
|
||||
toks.append((child[0], f"B-{child.label()}"))
|
||||
for tok in child[1:]:
|
||||
toks.append((tok, f"I-{child.label()}"))
|
||||
else:
|
||||
toks.append((child, "O"))
|
||||
return toks
|
||||
|
||||
|
||||
def shape(word):
|
||||
if re.match(r"[0-9]+(\.[0-9]*)?|[0-9]*\.[0-9]+$", word, re.UNICODE):
|
||||
return "number"
|
||||
elif re.match(r"\W+$", word, re.UNICODE):
|
||||
return "punct"
|
||||
elif re.match(r"\w+$", word, re.UNICODE):
|
||||
if word.istitle():
|
||||
return "upcase"
|
||||
elif word.islower():
|
||||
return "downcase"
|
||||
else:
|
||||
return "mixedcase"
|
||||
else:
|
||||
return "other"
|
||||
|
||||
|
||||
def simplify_pos(s):
|
||||
if s.startswith("V"):
|
||||
return "V"
|
||||
else:
|
||||
return s.split("-")[0]
|
||||
|
||||
|
||||
def postag_tree(tree):
|
||||
# Part-of-speech tagging.
|
||||
words = tree.leaves()
|
||||
tag_iter = (pos for (word, pos) in pos_tag(words))
|
||||
newtree = Tree("S", [])
|
||||
for child in tree:
|
||||
if isinstance(child, Tree):
|
||||
newtree.append(Tree(child.label(), []))
|
||||
for subchild in child:
|
||||
newtree[-1].append((subchild, next(tag_iter)))
|
||||
else:
|
||||
newtree.append((child, next(tag_iter)))
|
||||
return newtree
|
||||
|
||||
|
||||
def load_ace_data(roots, fmt="binary", skip_bnews=True):
|
||||
for root in roots:
|
||||
for root, dirs, files in os.walk(root):
|
||||
if root.endswith("bnews") and skip_bnews:
|
||||
continue
|
||||
for f in files:
|
||||
if f.endswith(".sgm"):
|
||||
yield from load_ace_file(os.path.join(root, f), fmt)
|
||||
|
||||
|
||||
def load_ace_file(textfile, fmt):
|
||||
print(f" - {os.path.split(textfile)[1]}")
|
||||
annfile = textfile + ".tmx.rdc.xml"
|
||||
|
||||
# Read the xml file, and get a list of entities
|
||||
entities = []
|
||||
with open(annfile) as infile:
|
||||
xml = ET.parse(infile).getroot()
|
||||
for entity in xml.findall("document/entity"):
|
||||
typ = entity.find("entity_type").text
|
||||
for mention in entity.findall("entity_mention"):
|
||||
if mention.get("TYPE") != "NAME":
|
||||
continue # only NEs
|
||||
s = int(mention.find("head/charseq/start").text)
|
||||
e = int(mention.find("head/charseq/end").text) + 1
|
||||
entities.append((s, e, typ))
|
||||
|
||||
# Read the text file, and mark the entities.
|
||||
with open(textfile) as infile:
|
||||
text = infile.read()
|
||||
|
||||
# Strip XML tags, since they don't count towards the indices
|
||||
text = re.sub("<(?!/?TEXT)[^>]+>", "", text)
|
||||
|
||||
# Blank out anything before/after <TEXT>
|
||||
def subfunc(m):
|
||||
return " " * (m.end() - m.start() - 6)
|
||||
|
||||
text = re.sub(r"[\s\S]*<TEXT>", subfunc, text)
|
||||
text = re.sub(r"</TEXT>[\s\S]*", "", text)
|
||||
|
||||
# Simplify quotes
|
||||
text = re.sub("``", ' "', text)
|
||||
text = re.sub("''", '" ', text)
|
||||
|
||||
entity_types = {typ for (s, e, typ) in entities}
|
||||
|
||||
# Binary distinction (NE or not NE)
|
||||
if fmt == "binary":
|
||||
i = 0
|
||||
toks = Tree("S", [])
|
||||
for s, e, typ in sorted(entities):
|
||||
if s < i:
|
||||
s = i # Overlapping! Deal with this better?
|
||||
if e <= s:
|
||||
continue
|
||||
toks.extend(word_tokenize(text[i:s]))
|
||||
toks.append(Tree("NE", text[s:e].split()))
|
||||
i = e
|
||||
toks.extend(word_tokenize(text[i:]))
|
||||
yield toks
|
||||
|
||||
# Multiclass distinction (NE type)
|
||||
elif fmt == "multiclass":
|
||||
i = 0
|
||||
toks = Tree("S", [])
|
||||
for s, e, typ in sorted(entities):
|
||||
if s < i:
|
||||
s = i # Overlapping! Deal with this better?
|
||||
if e <= s:
|
||||
continue
|
||||
toks.extend(word_tokenize(text[i:s]))
|
||||
toks.append(Tree(typ, text[s:e].split()))
|
||||
i = e
|
||||
toks.extend(word_tokenize(text[i:]))
|
||||
yield toks
|
||||
|
||||
else:
|
||||
raise ValueError("bad fmt value")
|
||||
|
||||
|
||||
# This probably belongs in a more general-purpose location (as does
|
||||
# the parse_to_tagged function).
|
||||
def cmp_chunks(correct, guessed):
|
||||
correct = NEChunkParser._parse_to_tagged(correct)
|
||||
guessed = NEChunkParser._parse_to_tagged(guessed)
|
||||
ellipsis = False
|
||||
for (w, ct), (w, gt) in zip(correct, guessed):
|
||||
if ct == gt == "O":
|
||||
if not ellipsis:
|
||||
print(f" {ct:15} {gt:15} {w}")
|
||||
print(" {:15} {:15} {2}".format("...", "...", "..."))
|
||||
ellipsis = True
|
||||
else:
|
||||
ellipsis = False
|
||||
print(f" {ct:15} {gt:15} {w}")
|
||||
|
||||
|
||||
# ======================================================================================
|
||||
|
||||
|
||||
class Maxent_NE_Chunker(NEChunkParser):
|
||||
"""
|
||||
Expected input: list of pos-tagged words
|
||||
"""
|
||||
|
||||
def __init__(self, fmt="multiclass"):
|
||||
from nltk.data import find
|
||||
|
||||
self._fmt = fmt
|
||||
self._tab_dir = find(f"chunkers/maxent_ne_chunker_tab/english_ace_{fmt}/")
|
||||
self.load_params()
|
||||
|
||||
def load_params(self):
|
||||
from nltk.classify.maxent import BinaryMaxentFeatureEncoding, load_maxent_params
|
||||
|
||||
wgt, mpg, lab, aon = load_maxent_params(self._tab_dir)
|
||||
mc = MaxentClassifier(
|
||||
BinaryMaxentFeatureEncoding(lab, mpg, alwayson_features=aon), wgt
|
||||
)
|
||||
self._tagger = NEChunkParserTagger(classifier=mc)
|
||||
|
||||
def save_params(self):
|
||||
from nltk.classify.maxent import save_maxent_params
|
||||
|
||||
classif = self._tagger._classifier
|
||||
ecg = classif._encoding
|
||||
wgt = classif._weights
|
||||
mpg = ecg._mapping
|
||||
lab = ecg._labels
|
||||
aon = ecg._alwayson
|
||||
fmt = self._fmt
|
||||
save_maxent_params(wgt, mpg, lab, aon, tab_dir=f"/tmp/english_ace_{fmt}/")
|
||||
|
||||
|
||||
def build_model(fmt="multiclass"):
|
||||
chunker = Maxent_NE_Chunker(fmt)
|
||||
chunker.save_params()
|
||||
return chunker
|
||||
|
||||
|
||||
# ======================================================================================
|
||||
|
||||
"""
|
||||
2004 update: pickles are not supported anymore.
|
||||
|
||||
Deprecated:
|
||||
|
||||
def build_model(fmt="binary"):
|
||||
print("Loading training data...")
|
||||
train_paths = [
|
||||
find("corpora/ace_data/ace.dev"),
|
||||
find("corpora/ace_data/ace.heldout"),
|
||||
find("corpora/ace_data/bbn.dev"),
|
||||
find("corpora/ace_data/muc.dev"),
|
||||
]
|
||||
train_trees = load_ace_data(train_paths, fmt)
|
||||
train_data = [postag_tree(t) for t in train_trees]
|
||||
print("Training...")
|
||||
cp = NEChunkParser(train_data)
|
||||
del train_data
|
||||
|
||||
print("Loading eval data...")
|
||||
eval_paths = [find("corpora/ace_data/ace.eval")]
|
||||
eval_trees = load_ace_data(eval_paths, fmt)
|
||||
eval_data = [postag_tree(t) for t in eval_trees]
|
||||
|
||||
print("Evaluating...")
|
||||
chunkscore = ChunkScore()
|
||||
for i, correct in enumerate(eval_data):
|
||||
guess = cp.parse(correct.leaves())
|
||||
chunkscore.score(correct, guess)
|
||||
if i < 3:
|
||||
cmp_chunks(correct, guess)
|
||||
print(chunkscore)
|
||||
|
||||
outfilename = f"/tmp/ne_chunker_{fmt}.pickle"
|
||||
print(f"Saving chunker to {outfilename}...")
|
||||
|
||||
with open(outfilename, "wb") as outfile:
|
||||
pickle.dump(cp, outfile, -1)
|
||||
|
||||
return cp
|
||||
"""
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Make sure that the object has the right class name:
|
||||
build_model("binary")
|
||||
build_model("multiclass")
|
||||
1474
Backend/venv/lib/python3.12/site-packages/nltk/chunk/regexp.py
Normal file
1474
Backend/venv/lib/python3.12/site-packages/nltk/chunk/regexp.py
Normal file
File diff suppressed because it is too large
Load Diff
642
Backend/venv/lib/python3.12/site-packages/nltk/chunk/util.py
Normal file
642
Backend/venv/lib/python3.12/site-packages/nltk/chunk/util.py
Normal file
@@ -0,0 +1,642 @@
|
||||
# Natural Language Toolkit: Chunk format conversions
|
||||
#
|
||||
# Copyright (C) 2001-2025 NLTK Project
|
||||
# Author: Edward Loper <edloper@gmail.com>
|
||||
# Steven Bird <stevenbird1@gmail.com> (minor additions)
|
||||
# URL: <https://www.nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
|
||||
import re
|
||||
|
||||
from nltk.metrics import accuracy as _accuracy
|
||||
from nltk.tag.mapping import map_tag
|
||||
from nltk.tag.util import str2tuple
|
||||
from nltk.tree import Tree
|
||||
|
||||
##//////////////////////////////////////////////////////
|
||||
## EVALUATION
|
||||
##//////////////////////////////////////////////////////
|
||||
|
||||
|
||||
def accuracy(chunker, gold):
|
||||
"""
|
||||
Score the accuracy of the chunker against the gold standard.
|
||||
Strip the chunk information from the gold standard and rechunk it using
|
||||
the chunker, then compute the accuracy score.
|
||||
|
||||
:type chunker: ChunkParserI
|
||||
:param chunker: The chunker being evaluated.
|
||||
:type gold: tree
|
||||
:param gold: The chunk structures to score the chunker on.
|
||||
:rtype: float
|
||||
"""
|
||||
|
||||
gold_tags = []
|
||||
test_tags = []
|
||||
for gold_tree in gold:
|
||||
test_tree = chunker.parse(gold_tree.flatten())
|
||||
gold_tags += tree2conlltags(gold_tree)
|
||||
test_tags += tree2conlltags(test_tree)
|
||||
|
||||
# print 'GOLD:', gold_tags[:50]
|
||||
# print 'TEST:', test_tags[:50]
|
||||
return _accuracy(gold_tags, test_tags)
|
||||
|
||||
|
||||
# Patched for increased performance by Yoav Goldberg <yoavg@cs.bgu.ac.il>, 2006-01-13
|
||||
# -- statistics are evaluated only on demand, instead of at every sentence evaluation
|
||||
#
|
||||
# SB: use nltk.metrics for precision/recall scoring?
|
||||
#
|
||||
class ChunkScore:
|
||||
"""
|
||||
A utility class for scoring chunk parsers. ``ChunkScore`` can
|
||||
evaluate a chunk parser's output, based on a number of statistics
|
||||
(precision, recall, f-measure, misssed chunks, incorrect chunks).
|
||||
It can also combine the scores from the parsing of multiple texts;
|
||||
this makes it significantly easier to evaluate a chunk parser that
|
||||
operates one sentence at a time.
|
||||
|
||||
Texts are evaluated with the ``score`` method. The results of
|
||||
evaluation can be accessed via a number of accessor methods, such
|
||||
as ``precision`` and ``f_measure``. A typical use of the
|
||||
``ChunkScore`` class is::
|
||||
|
||||
>>> chunkscore = ChunkScore() # doctest: +SKIP
|
||||
>>> for correct in correct_sentences: # doctest: +SKIP
|
||||
... guess = chunkparser.parse(correct.leaves()) # doctest: +SKIP
|
||||
... chunkscore.score(correct, guess) # doctest: +SKIP
|
||||
>>> print('F Measure:', chunkscore.f_measure()) # doctest: +SKIP
|
||||
F Measure: 0.823
|
||||
|
||||
:ivar kwargs: Keyword arguments:
|
||||
|
||||
- max_tp_examples: The maximum number actual examples of true
|
||||
positives to record. This affects the ``correct`` member
|
||||
function: ``correct`` will not return more than this number
|
||||
of true positive examples. This does *not* affect any of
|
||||
the numerical metrics (precision, recall, or f-measure)
|
||||
|
||||
- max_fp_examples: The maximum number actual examples of false
|
||||
positives to record. This affects the ``incorrect`` member
|
||||
function and the ``guessed`` member function: ``incorrect``
|
||||
will not return more than this number of examples, and
|
||||
``guessed`` will not return more than this number of true
|
||||
positive examples. This does *not* affect any of the
|
||||
numerical metrics (precision, recall, or f-measure)
|
||||
|
||||
- max_fn_examples: The maximum number actual examples of false
|
||||
negatives to record. This affects the ``missed`` member
|
||||
function and the ``correct`` member function: ``missed``
|
||||
will not return more than this number of examples, and
|
||||
``correct`` will not return more than this number of true
|
||||
negative examples. This does *not* affect any of the
|
||||
numerical metrics (precision, recall, or f-measure)
|
||||
|
||||
- chunk_label: A regular expression indicating which chunks
|
||||
should be compared. Defaults to ``'.*'`` (i.e., all chunks).
|
||||
|
||||
:type _tp: list(Token)
|
||||
:ivar _tp: List of true positives
|
||||
:type _fp: list(Token)
|
||||
:ivar _fp: List of false positives
|
||||
:type _fn: list(Token)
|
||||
:ivar _fn: List of false negatives
|
||||
|
||||
:type _tp_num: int
|
||||
:ivar _tp_num: Number of true positives
|
||||
:type _fp_num: int
|
||||
:ivar _fp_num: Number of false positives
|
||||
:type _fn_num: int
|
||||
:ivar _fn_num: Number of false negatives.
|
||||
"""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
self._correct = set()
|
||||
self._guessed = set()
|
||||
self._tp = set()
|
||||
self._fp = set()
|
||||
self._fn = set()
|
||||
self._max_tp = kwargs.get("max_tp_examples", 100)
|
||||
self._max_fp = kwargs.get("max_fp_examples", 100)
|
||||
self._max_fn = kwargs.get("max_fn_examples", 100)
|
||||
self._chunk_label = kwargs.get("chunk_label", ".*")
|
||||
self._tp_num = 0
|
||||
self._fp_num = 0
|
||||
self._fn_num = 0
|
||||
self._count = 0
|
||||
self._tags_correct = 0.0
|
||||
self._tags_total = 0.0
|
||||
|
||||
self._measuresNeedUpdate = False
|
||||
|
||||
def _updateMeasures(self):
|
||||
if self._measuresNeedUpdate:
|
||||
self._tp = self._guessed & self._correct
|
||||
self._fn = self._correct - self._guessed
|
||||
self._fp = self._guessed - self._correct
|
||||
self._tp_num = len(self._tp)
|
||||
self._fp_num = len(self._fp)
|
||||
self._fn_num = len(self._fn)
|
||||
self._measuresNeedUpdate = False
|
||||
|
||||
def score(self, correct, guessed):
|
||||
"""
|
||||
Given a correctly chunked sentence, score another chunked
|
||||
version of the same sentence.
|
||||
|
||||
:type correct: chunk structure
|
||||
:param correct: The known-correct ("gold standard") chunked
|
||||
sentence.
|
||||
:type guessed: chunk structure
|
||||
:param guessed: The chunked sentence to be scored.
|
||||
"""
|
||||
self._correct |= _chunksets(correct, self._count, self._chunk_label)
|
||||
self._guessed |= _chunksets(guessed, self._count, self._chunk_label)
|
||||
self._count += 1
|
||||
self._measuresNeedUpdate = True
|
||||
# Keep track of per-tag accuracy (if possible)
|
||||
try:
|
||||
correct_tags = tree2conlltags(correct)
|
||||
guessed_tags = tree2conlltags(guessed)
|
||||
except ValueError:
|
||||
# This exception case is for nested chunk structures,
|
||||
# where tree2conlltags will fail with a ValueError: "Tree
|
||||
# is too deeply nested to be printed in CoNLL format."
|
||||
correct_tags = guessed_tags = ()
|
||||
self._tags_total += len(correct_tags)
|
||||
self._tags_correct += sum(
|
||||
1 for (t, g) in zip(guessed_tags, correct_tags) if t == g
|
||||
)
|
||||
|
||||
def accuracy(self):
|
||||
"""
|
||||
Return the overall tag-based accuracy for all text that have
|
||||
been scored by this ``ChunkScore``, using the IOB (conll2000)
|
||||
tag encoding.
|
||||
|
||||
:rtype: float
|
||||
"""
|
||||
if self._tags_total == 0:
|
||||
return 1
|
||||
return self._tags_correct / self._tags_total
|
||||
|
||||
def precision(self):
|
||||
"""
|
||||
Return the overall precision for all texts that have been
|
||||
scored by this ``ChunkScore``.
|
||||
|
||||
:rtype: float
|
||||
"""
|
||||
self._updateMeasures()
|
||||
div = self._tp_num + self._fp_num
|
||||
if div == 0:
|
||||
return 0
|
||||
else:
|
||||
return self._tp_num / div
|
||||
|
||||
def recall(self):
|
||||
"""
|
||||
Return the overall recall for all texts that have been
|
||||
scored by this ``ChunkScore``.
|
||||
|
||||
:rtype: float
|
||||
"""
|
||||
self._updateMeasures()
|
||||
div = self._tp_num + self._fn_num
|
||||
if div == 0:
|
||||
return 0
|
||||
else:
|
||||
return self._tp_num / div
|
||||
|
||||
def f_measure(self, alpha=0.5):
|
||||
"""
|
||||
Return the overall F measure for all texts that have been
|
||||
scored by this ``ChunkScore``.
|
||||
|
||||
:param alpha: the relative weighting of precision and recall.
|
||||
Larger alpha biases the score towards the precision value,
|
||||
while smaller alpha biases the score towards the recall
|
||||
value. ``alpha`` should have a value in the range [0,1].
|
||||
:type alpha: float
|
||||
:rtype: float
|
||||
"""
|
||||
self._updateMeasures()
|
||||
p = self.precision()
|
||||
r = self.recall()
|
||||
if p == 0 or r == 0: # what if alpha is 0 or 1?
|
||||
return 0
|
||||
return 1 / (alpha / p + (1 - alpha) / r)
|
||||
|
||||
def missed(self):
|
||||
"""
|
||||
Return the chunks which were included in the
|
||||
correct chunk structures, but not in the guessed chunk
|
||||
structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
self._updateMeasures()
|
||||
chunks = list(self._fn)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def incorrect(self):
|
||||
"""
|
||||
Return the chunks which were included in the guessed chunk structures,
|
||||
but not in the correct chunk structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
self._updateMeasures()
|
||||
chunks = list(self._fp)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def correct(self):
|
||||
"""
|
||||
Return the chunks which were included in the correct
|
||||
chunk structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
chunks = list(self._correct)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def guessed(self):
|
||||
"""
|
||||
Return the chunks which were included in the guessed
|
||||
chunk structures, listed in input order.
|
||||
|
||||
:rtype: list of chunks
|
||||
"""
|
||||
chunks = list(self._guessed)
|
||||
return [c[1] for c in chunks] # discard position information
|
||||
|
||||
def __len__(self):
|
||||
self._updateMeasures()
|
||||
return self._tp_num + self._fn_num
|
||||
|
||||
def __repr__(self):
|
||||
"""
|
||||
Return a concise representation of this ``ChunkScoring``.
|
||||
|
||||
:rtype: str
|
||||
"""
|
||||
return "<ChunkScoring of " + repr(len(self)) + " chunks>"
|
||||
|
||||
def __str__(self):
|
||||
"""
|
||||
Return a verbose representation of this ``ChunkScoring``.
|
||||
This representation includes the precision, recall, and
|
||||
f-measure scores. For other information about the score,
|
||||
use the accessor methods (e.g., ``missed()`` and ``incorrect()``).
|
||||
|
||||
:rtype: str
|
||||
"""
|
||||
return (
|
||||
"ChunkParse score:\n"
|
||||
+ f" IOB Accuracy: {self.accuracy() * 100:5.1f}%\n"
|
||||
+ f" Precision: {self.precision() * 100:5.1f}%\n"
|
||||
+ f" Recall: {self.recall() * 100:5.1f}%\n"
|
||||
+ f" F-Measure: {self.f_measure() * 100:5.1f}%"
|
||||
)
|
||||
|
||||
|
||||
# extract chunks, and assign unique id, the absolute position of
|
||||
# the first word of the chunk
|
||||
def _chunksets(t, count, chunk_label):
|
||||
pos = 0
|
||||
chunks = []
|
||||
for child in t:
|
||||
if isinstance(child, Tree):
|
||||
if re.match(chunk_label, child.label()):
|
||||
chunks.append(((count, pos), child.freeze()))
|
||||
pos += len(child.leaves())
|
||||
else:
|
||||
pos += 1
|
||||
return set(chunks)
|
||||
|
||||
|
||||
def tagstr2tree(
|
||||
s, chunk_label="NP", root_label="S", sep="/", source_tagset=None, target_tagset=None
|
||||
):
|
||||
"""
|
||||
Divide a string of bracketted tagged text into
|
||||
chunks and unchunked tokens, and produce a Tree.
|
||||
Chunks are marked by square brackets (``[...]``). Words are
|
||||
delimited by whitespace, and each word should have the form
|
||||
``text/tag``. Words that do not contain a slash are
|
||||
assigned a ``tag`` of None.
|
||||
|
||||
:param s: The string to be converted
|
||||
:type s: str
|
||||
:param chunk_label: The label to use for chunk nodes
|
||||
:type chunk_label: str
|
||||
:param root_label: The label to use for the root of the tree
|
||||
:type root_label: str
|
||||
:rtype: Tree
|
||||
"""
|
||||
|
||||
WORD_OR_BRACKET = re.compile(r"\[|\]|[^\[\]\s]+")
|
||||
|
||||
stack = [Tree(root_label, [])]
|
||||
for match in WORD_OR_BRACKET.finditer(s):
|
||||
text = match.group()
|
||||
if text[0] == "[":
|
||||
if len(stack) != 1:
|
||||
raise ValueError(f"Unexpected [ at char {match.start():d}")
|
||||
chunk = Tree(chunk_label, [])
|
||||
stack[-1].append(chunk)
|
||||
stack.append(chunk)
|
||||
elif text[0] == "]":
|
||||
if len(stack) != 2:
|
||||
raise ValueError(f"Unexpected ] at char {match.start():d}")
|
||||
stack.pop()
|
||||
else:
|
||||
if sep is None:
|
||||
stack[-1].append(text)
|
||||
else:
|
||||
word, tag = str2tuple(text, sep)
|
||||
if source_tagset and target_tagset:
|
||||
tag = map_tag(source_tagset, target_tagset, tag)
|
||||
stack[-1].append((word, tag))
|
||||
|
||||
if len(stack) != 1:
|
||||
raise ValueError(f"Expected ] at char {len(s):d}")
|
||||
return stack[0]
|
||||
|
||||
|
||||
### CONLL
|
||||
|
||||
_LINE_RE = re.compile(r"(\S+)\s+(\S+)\s+([IOB])-?(\S+)?")
|
||||
|
||||
|
||||
def conllstr2tree(s, chunk_types=("NP", "PP", "VP"), root_label="S"):
|
||||
"""
|
||||
Return a chunk structure for a single sentence
|
||||
encoded in the given CONLL 2000 style string.
|
||||
This function converts a CoNLL IOB string into a tree.
|
||||
It uses the specified chunk types
|
||||
(defaults to NP, PP and VP), and creates a tree rooted at a node
|
||||
labeled S (by default).
|
||||
|
||||
:param s: The CoNLL string to be converted.
|
||||
:type s: str
|
||||
:param chunk_types: The chunk types to be converted.
|
||||
:type chunk_types: tuple
|
||||
:param root_label: The node label to use for the root.
|
||||
:type root_label: str
|
||||
:rtype: Tree
|
||||
"""
|
||||
|
||||
stack = [Tree(root_label, [])]
|
||||
|
||||
for lineno, line in enumerate(s.split("\n")):
|
||||
if not line.strip():
|
||||
continue
|
||||
|
||||
# Decode the line.
|
||||
match = _LINE_RE.match(line)
|
||||
if match is None:
|
||||
raise ValueError(f"Error on line {lineno:d}")
|
||||
(word, tag, state, chunk_type) = match.groups()
|
||||
|
||||
# If it's a chunk type we don't care about, treat it as O.
|
||||
if chunk_types is not None and chunk_type not in chunk_types:
|
||||
state = "O"
|
||||
|
||||
# For "Begin"/"Outside", finish any completed chunks -
|
||||
# also do so for "Inside" which don't match the previous token.
|
||||
mismatch_I = state == "I" and chunk_type != stack[-1].label()
|
||||
if state in "BO" or mismatch_I:
|
||||
if len(stack) == 2:
|
||||
stack.pop()
|
||||
|
||||
# For "Begin", start a new chunk.
|
||||
if state == "B" or mismatch_I:
|
||||
chunk = Tree(chunk_type, [])
|
||||
stack[-1].append(chunk)
|
||||
stack.append(chunk)
|
||||
|
||||
# Add the new word token.
|
||||
stack[-1].append((word, tag))
|
||||
|
||||
return stack[0]
|
||||
|
||||
|
||||
def tree2conlltags(t):
|
||||
"""
|
||||
Return a list of 3-tuples containing ``(word, tag, IOB-tag)``.
|
||||
Convert a tree to the CoNLL IOB tag format.
|
||||
|
||||
:param t: The tree to be converted.
|
||||
:type t: Tree
|
||||
:rtype: list(tuple)
|
||||
"""
|
||||
|
||||
tags = []
|
||||
for child in t:
|
||||
try:
|
||||
category = child.label()
|
||||
prefix = "B-"
|
||||
for contents in child:
|
||||
if isinstance(contents, Tree):
|
||||
raise ValueError(
|
||||
"Tree is too deeply nested to be printed in CoNLL format"
|
||||
)
|
||||
tags.append((contents[0], contents[1], prefix + category))
|
||||
prefix = "I-"
|
||||
except AttributeError:
|
||||
tags.append((child[0], child[1], "O"))
|
||||
return tags
|
||||
|
||||
|
||||
def conlltags2tree(
|
||||
sentence, chunk_types=("NP", "PP", "VP"), root_label="S", strict=False
|
||||
):
|
||||
"""
|
||||
Convert the CoNLL IOB format to a tree.
|
||||
"""
|
||||
tree = Tree(root_label, [])
|
||||
for word, postag, chunktag in sentence:
|
||||
if chunktag is None:
|
||||
if strict:
|
||||
raise ValueError("Bad conll tag sequence")
|
||||
else:
|
||||
# Treat as O
|
||||
tree.append((word, postag))
|
||||
elif chunktag.startswith("B-"):
|
||||
tree.append(Tree(chunktag[2:], [(word, postag)]))
|
||||
elif chunktag.startswith("I-"):
|
||||
if (
|
||||
len(tree) == 0
|
||||
or not isinstance(tree[-1], Tree)
|
||||
or tree[-1].label() != chunktag[2:]
|
||||
):
|
||||
if strict:
|
||||
raise ValueError("Bad conll tag sequence")
|
||||
else:
|
||||
# Treat as B-*
|
||||
tree.append(Tree(chunktag[2:], [(word, postag)]))
|
||||
else:
|
||||
tree[-1].append((word, postag))
|
||||
elif chunktag == "O":
|
||||
tree.append((word, postag))
|
||||
else:
|
||||
raise ValueError(f"Bad conll tag {chunktag!r}")
|
||||
return tree
|
||||
|
||||
|
||||
def tree2conllstr(t):
|
||||
"""
|
||||
Return a multiline string where each line contains a word, tag and IOB tag.
|
||||
Convert a tree to the CoNLL IOB string format
|
||||
|
||||
:param t: The tree to be converted.
|
||||
:type t: Tree
|
||||
:rtype: str
|
||||
"""
|
||||
lines = [" ".join(token) for token in tree2conlltags(t)]
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
### IEER
|
||||
|
||||
_IEER_DOC_RE = re.compile(
|
||||
r"<DOC>\s*"
|
||||
r"(<DOCNO>\s*(?P<docno>.+?)\s*</DOCNO>\s*)?"
|
||||
r"(<DOCTYPE>\s*(?P<doctype>.+?)\s*</DOCTYPE>\s*)?"
|
||||
r"(<DATE_TIME>\s*(?P<date_time>.+?)\s*</DATE_TIME>\s*)?"
|
||||
r"<BODY>\s*"
|
||||
r"(<HEADLINE>\s*(?P<headline>.+?)\s*</HEADLINE>\s*)?"
|
||||
r"<TEXT>(?P<text>.*?)</TEXT>\s*"
|
||||
r"</BODY>\s*</DOC>\s*",
|
||||
re.DOTALL,
|
||||
)
|
||||
|
||||
_IEER_TYPE_RE = re.compile(r'<b_\w+\s+[^>]*?type="(?P<type>\w+)"')
|
||||
|
||||
|
||||
def _ieer_read_text(s, root_label):
|
||||
stack = [Tree(root_label, [])]
|
||||
# s will be None if there is no headline in the text
|
||||
# return the empty list in place of a Tree
|
||||
if s is None:
|
||||
return []
|
||||
for piece_m in re.finditer(r"<[^>]+>|[^\s<]+", s):
|
||||
piece = piece_m.group()
|
||||
try:
|
||||
if piece.startswith("<b_"):
|
||||
m = _IEER_TYPE_RE.match(piece)
|
||||
if m is None:
|
||||
print("XXXX", piece)
|
||||
chunk = Tree(m.group("type"), [])
|
||||
stack[-1].append(chunk)
|
||||
stack.append(chunk)
|
||||
elif piece.startswith("<e_"):
|
||||
stack.pop()
|
||||
# elif piece.startswith('<'):
|
||||
# print "ERROR:", piece
|
||||
# raise ValueError # Unexpected HTML
|
||||
else:
|
||||
stack[-1].append(piece)
|
||||
except (IndexError, ValueError) as e:
|
||||
raise ValueError(
|
||||
f"Bad IEER string (error at character {piece_m.start():d})"
|
||||
) from e
|
||||
if len(stack) != 1:
|
||||
raise ValueError("Bad IEER string")
|
||||
return stack[0]
|
||||
|
||||
|
||||
def ieerstr2tree(
|
||||
s,
|
||||
chunk_types=[
|
||||
"LOCATION",
|
||||
"ORGANIZATION",
|
||||
"PERSON",
|
||||
"DURATION",
|
||||
"DATE",
|
||||
"CARDINAL",
|
||||
"PERCENT",
|
||||
"MONEY",
|
||||
"MEASURE",
|
||||
],
|
||||
root_label="S",
|
||||
):
|
||||
"""
|
||||
Return a chunk structure containing the chunked tagged text that is
|
||||
encoded in the given IEER style string.
|
||||
Convert a string of chunked tagged text in the IEER named
|
||||
entity format into a chunk structure. Chunks are of several
|
||||
types, LOCATION, ORGANIZATION, PERSON, DURATION, DATE, CARDINAL,
|
||||
PERCENT, MONEY, and MEASURE.
|
||||
|
||||
:rtype: Tree
|
||||
"""
|
||||
|
||||
# Try looking for a single document. If that doesn't work, then just
|
||||
# treat everything as if it was within the <TEXT>...</TEXT>.
|
||||
m = _IEER_DOC_RE.match(s)
|
||||
if m:
|
||||
return {
|
||||
"text": _ieer_read_text(m.group("text"), root_label),
|
||||
"docno": m.group("docno"),
|
||||
"doctype": m.group("doctype"),
|
||||
"date_time": m.group("date_time"),
|
||||
#'headline': m.group('headline')
|
||||
# we want to capture NEs in the headline too!
|
||||
"headline": _ieer_read_text(m.group("headline"), root_label),
|
||||
}
|
||||
else:
|
||||
return _ieer_read_text(s, root_label)
|
||||
|
||||
|
||||
def demo():
|
||||
s = "[ Pierre/NNP Vinken/NNP ] ,/, [ 61/CD years/NNS ] old/JJ ,/, will/MD join/VB [ the/DT board/NN ] ./."
|
||||
import nltk
|
||||
|
||||
t = nltk.chunk.tagstr2tree(s, chunk_label="NP")
|
||||
t.pprint()
|
||||
print()
|
||||
|
||||
s = """
|
||||
These DT B-NP
|
||||
research NN I-NP
|
||||
protocols NNS I-NP
|
||||
offer VBP B-VP
|
||||
to TO B-PP
|
||||
the DT B-NP
|
||||
patient NN I-NP
|
||||
not RB O
|
||||
only RB O
|
||||
the DT B-NP
|
||||
very RB I-NP
|
||||
best JJS I-NP
|
||||
therapy NN I-NP
|
||||
which WDT B-NP
|
||||
we PRP B-NP
|
||||
have VBP B-VP
|
||||
established VBN I-VP
|
||||
today NN B-NP
|
||||
but CC B-NP
|
||||
also RB I-NP
|
||||
the DT B-NP
|
||||
hope NN I-NP
|
||||
of IN B-PP
|
||||
something NN B-NP
|
||||
still RB B-ADJP
|
||||
better JJR I-ADJP
|
||||
. . O
|
||||
"""
|
||||
|
||||
conll_tree = conllstr2tree(s, chunk_types=("NP", "PP"))
|
||||
conll_tree.pprint()
|
||||
|
||||
# Demonstrate CoNLL output
|
||||
print("CoNLL output:")
|
||||
print(nltk.chunk.tree2conllstr(conll_tree))
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
demo()
|
||||
Reference in New Issue
Block a user