`segmentador.optimize.models`

Models with optimized format for inference.

Module Contents

Classes

`ONNXBERTSegmenter`	BERT segmenter in ONNX format.
`ONNXLSTMSegmenter`	LSTM segmenter in ONNX format.
`TorchJITBERTSegmenter`	BERT segmenter in Torch JIT format.
`TorchJITLSTMSegmenter`	LSTM segmenter in Torch JIT format.

class segmentador.optimize.models.ONNXBERTSegmenter(uri_model: str, uri_tokenizer: str, inference_pooling_operation: str = 'sum', local_files_only: bool = True, cache_dir_model: str = './cache/models', cache_dir_tokenizer: str = './cache/tokenizers', uri_model_extension: str = '.onnx')

Bases: segmentador._base.BaseSegmenter

BERT segmenter in ONNX format.

The ONNX format support faster inference, quantized and optimized models with hardware-specific instructions.

Uses a pretrained Transformer Encoder to segment Brazilian Portuguese legal texts. The pretrained models support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.

Parameters

uri_model (str) – URI to load pretrained model from. If local_files_only=True, then it must be a local file.
uri_tokenizer (str) – URI to pretrained text Tokenizer.
uri_onnx_config (str) – URI to pickled ONNX configuration.
inference_pooling_operation ({'max', 'sum', 'gaussian', 'assymetric-max'}, default='sum') –
Specify the strategy used to combine logits during model inference for documents larger than 1024 subword tokens. Larger documents are sharded into possibly overlapping windows of 1024 subwords each. Thus, a single token may have multiple logits (and, therefore, predictions) associated with it. This argument defines how exactly the logits should be combined in order to derive the final verdict for that said token. The possible choices for this argument are:
- max: take the maximum logit of each token;
- sum: sum the logits associated with the same token;
- gaussian: build a gaussian filter that weights higher logits based on how close to the window center they are, diminishing its weights closer to the window limits; and
- assymetric-max: take the maximum logit of each token for all classes other than the No-operation class, which in turn receives the minimum among all corresponding logits instead.
local_files_only (bool, default=True) – If True, will search only for local pretrained model and tokenizers. If False, may download models from Huggingface HUB, if necessary.
cache_dir_model (str, default='./cache/models') – Cache directory for transformer encoder model.
cache_dir_tokenizer (str, default='./cache/tokenizers') – Cache directory for text tokenizer.
uri_model_extension (str, default='.onnx') – Expected file extension of model local file. If uri_model does not ends with the provided extension, it will be appended to the end of URI before loading model.

NUM_CLASSES = 4

eval(self) → ONNXBERTSegmenter: No-op method, created only to keep API consistent.

train(self) → ONNXBERTSegmenter: No-op method, created only to keep API consistent.

__call__(self, *args: Any, **kwargs: Any) → Union[List[str], Tuple[List[Any], Ellipsis]]

__repr__(self) → str: Return repr(self).

to(self, device: Union[str, torch.device]) → BaseSegmenter: Move underlying model to device.

property model(self) → Union[torch.nn.Module, transformers.BertForTokenClassification]

property tokenizer(self) → transformers.BertTokenizerFast

property RE_JUSTIFICATIVA(self) → regex.Pattern: Regular expression used to detect ‘justificativa’ blocks.

classmethod preprocess_legal_text(cls, text: str, return_justificativa: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[str, Tuple[str, List[str]]]

Apply minimal legal text preprocessing.

The preprocessing steps are: 1. Coalesce all blank spaces in text; 2. Remove all trailing and leading blank spaces; and 3. Pre-segment text into legal text content and justificativa.

Parameters

text (str) – Text to be preprocessed.
return_justificativa (bool, default=False) – If True, return a tuple in the format (content, justificativa). If False, return only content.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

preprocessed_text (str) – Content from text after the preprocessing steps.
justificativa_block (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.

generate_segments_from_ids(self, input_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], label_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], apply_postprocessing: bool = True) → List[str]

Generate segments from ids and labels.

Parameters

input_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Tokenized text from model’s tokenizer.
label_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Label ids for each token, where ‘label_id=1’ denotes the start of a new segment.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.

Returns

segments – List containing all segments in textual form.

Return type

t.List[str]

segment_legal_text(self, text: Union[str, Dict[str, List[int]]], batch_size: int = 32, moving_window_size: int = 512, window_shift_size: Union[float, int] = 0.25, return_justificativa: bool = False, return_labels: bool = False, return_logits: bool = False, remove_noise_subsegments: bool = False, maximum_noise_subsegment_length: int = 25, apply_postprocessing: bool = True, show_progress_bar: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[List[str], Tuple[List[Any], Ellipsis]]

Segment legal text.

The pretrained model support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.

The block size can be configured to smaller (not larger) values using the moving_window_size from BERTSegmenter.segment_legal_text method during inference.

Parameters

text (str or t.Dict[str, t.List[int]]) – Legal text to be segmented.
batch_size (int, default=32) – Maximum batch size feed document blocks in parallel to model. Higher values leads to faster inference with higher memory cost.
moving_window_size (int, default=512) – Moving window size, the maximum number of subwords feed in simultaneously to the segmenter model. Higher values leads to larger contexts for each token, at the expense of higher memory usage.
window_shift_size (int or float, default=0.25) –
Moving window shift size.
- If integer, specify the shift size per step exactly, and it must be in [1, 1024] range.
- If float, the shift size is calculated as window_shift_size * moving_window_size (rounded up), and it must be in the (0.0, 1.0] range.
Overlapping logits are combined using the strategy specified by the argument inference_pooling_operation in Segmenter model initialization.

The final prediction for each token is derived from the combined logits.
return_justificativa (bool, default=False) – If True, return contents from the ‘justificativa’ block from document.
return_labels (bool, default=False) – If True, return label list for each token.
return_logits (bool, default=False) – If True, return logit array for each token.
remove_noise_subsegments (bool, default=False) –
If True, remove all tokens between tokens classified as noise_start (inclusive) and noise_end or segment (either exclusive), whichever occurs first.
- Tokens classified as noise_end are kept. In other words, they are the first non-noise token past the previous noise subsegment.
- Tokens between noise_start and the sentence end are also removed.
- Tokens between the sentence end and noise_end are kept.
- Only the closest noise_start for every noise_end (or the sentence end) are considered. In other words, redundant noise_start tokens are ignored.
maximum_noise_subsegment_length (int, default=25) – Maximum length (in tokens) allowed for each noise subsegments in order to be removed. Larger noise subsegments are kept intact. This argument is useful to prevent removing larger chunks of text that might actually contain useful information.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
show_progress_bar (bool, default=False) – If True, show segmentation progress bar.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

segments (t.List[str]) – Segmented legal text.
justificativa (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
labels (npt.NDArray[np.int32] of shape (N,)) – Predicted labels for each token, where N is the length of tokenized document (in subword units). The -100 labels is a special legal, and ignored while computing the loss function during training. Only returned if return_labels=True.
logits (npt.NDArray[np.float64] of shape (N, C)) – Predicted logits for each token, where N is the length of tokenized document (in subword units), and C is equal to the Segmenter.NUM_CLASSES attribute. Only returned if return_logits=True.

class segmentador.optimize.models.ONNXLSTMSegmenter(uri_model: str, uri_tokenizer: str, inference_pooling_operation: str = 'sum', local_files_only: bool = True, cache_dir_model: str = './cache/models', cache_dir_tokenizer: str = './cache/tokenizers', uri_model_extension: str = '.onnx')

Bases: segmentador._base.BaseSegmenter

LSTM segmenter in ONNX format.

The ONNX format support faster inference, quantized and optimized models with hardware-specific instructions.

Uses a pretrained Transformer Encoder to segment Brazilian Portuguese legal texts. The pretrained models support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.

Parameters

uri_model (str) – URI to load pretrained model from. If local_files_only=True, then it must be a local file.
uri_tokenizer (str) – URI to pretrained text Tokenizer.
inference_pooling_operation ({'max', 'sum', 'gaussian', 'assymetric-max'}, default='sum') –
Specify the strategy used to combine logits during model inference for documents larger than 1024 subword tokens. Larger documents are sharded into possibly overlapping windows of 1024 subwords each. Thus, a single token may have multiple logits (and, therefore, predictions) associated with it. This argument defines how exactly the logits should be combined in order to derive the final verdict for that said token. The possible choices for this argument are:
- max: take the maximum logit of each token;
- sum: sum the logits associated with the same token;
- gaussian: build a gaussian filter that weights higher logits based on how close to the window center they are, diminishing its weights closer to the window limits; and
- assymetric-max: take the maximum logit of each token for all classes other than the No-operation class, which in turn receives the minimum among all corresponding logits instead.
local_files_only (bool, default=True) – If True, will search only for local pretrained model and tokenizers. If False, may download models from Huggingface HUB, if necessary.
cache_dir_model (str, default='./cache/models') – Cache directory for transformer encoder model.
cache_dir_tokenizer (str, default='./cache/tokenizers') – Cache directory for text tokenizer.
uri_model_extension (str, default='.onnx') – Expected file extension of model local file. If uri_model does not ends with the provided extension, it will be appended to the end of URI before loading model.

NUM_CLASSES = 4

eval(self) → ONNXLSTMSegmenter: No-op method, created only to keep API consistent.

train(self) → ONNXLSTMSegmenter: No-op method, created only to keep API consistent.

__call__(self, *args: Any, **kwargs: Any) → Union[List[str], Tuple[List[Any], Ellipsis]]

__repr__(self) → str: Return repr(self).

to(self, device: Union[str, torch.device]) → BaseSegmenter: Move underlying model to device.

property model(self) → Union[torch.nn.Module, transformers.BertForTokenClassification]

property tokenizer(self) → transformers.BertTokenizerFast

property RE_JUSTIFICATIVA(self) → regex.Pattern: Regular expression used to detect ‘justificativa’ blocks.

classmethod preprocess_legal_text(cls, text: str, return_justificativa: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[str, Tuple[str, List[str]]]

Apply minimal legal text preprocessing.

The preprocessing steps are: 1. Coalesce all blank spaces in text; 2. Remove all trailing and leading blank spaces; and 3. Pre-segment text into legal text content and justificativa.

Parameters

text (str) – Text to be preprocessed.
return_justificativa (bool, default=False) – If True, return a tuple in the format (content, justificativa). If False, return only content.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

preprocessed_text (str) – Content from text after the preprocessing steps.
justificativa_block (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.

generate_segments_from_ids(self, input_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], label_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], apply_postprocessing: bool = True) → List[str]

Generate segments from ids and labels.

Parameters

input_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Tokenized text from model’s tokenizer.
label_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Label ids for each token, where ‘label_id=1’ denotes the start of a new segment.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.

Returns

segments – List containing all segments in textual form.

Return type

t.List[str]

segment_legal_text(self, text: Union[str, Dict[str, List[int]]], batch_size: int = 32, moving_window_size: int = 512, window_shift_size: Union[float, int] = 0.25, return_justificativa: bool = False, return_labels: bool = False, return_logits: bool = False, remove_noise_subsegments: bool = False, maximum_noise_subsegment_length: int = 25, apply_postprocessing: bool = True, show_progress_bar: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[List[str], Tuple[List[Any], Ellipsis]]

Segment legal text.

The pretrained model support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.

The block size can be configured to smaller (not larger) values using the moving_window_size from BERTSegmenter.segment_legal_text method during inference.

Parameters

text (str or t.Dict[str, t.List[int]]) – Legal text to be segmented.
batch_size (int, default=32) – Maximum batch size feed document blocks in parallel to model. Higher values leads to faster inference with higher memory cost.
moving_window_size (int, default=512) – Moving window size, the maximum number of subwords feed in simultaneously to the segmenter model. Higher values leads to larger contexts for each token, at the expense of higher memory usage.
window_shift_size (int or float, default=0.25) –
Moving window shift size.
- If integer, specify the shift size per step exactly, and it must be in [1, 1024] range.
- If float, the shift size is calculated as window_shift_size * moving_window_size (rounded up), and it must be in the (0.0, 1.0] range.
Overlapping logits are combined using the strategy specified by the argument inference_pooling_operation in Segmenter model initialization.

The final prediction for each token is derived from the combined logits.
return_justificativa (bool, default=False) – If True, return contents from the ‘justificativa’ block from document.
return_labels (bool, default=False) – If True, return label list for each token.
return_logits (bool, default=False) – If True, return logit array for each token.
remove_noise_subsegments (bool, default=False) –
If True, remove all tokens between tokens classified as noise_start (inclusive) and noise_end or segment (either exclusive), whichever occurs first.
- Tokens classified as noise_end are kept. In other words, they are the first non-noise token past the previous noise subsegment.
- Tokens between noise_start and the sentence end are also removed.
- Tokens between the sentence end and noise_end are kept.
- Only the closest noise_start for every noise_end (or the sentence end) are considered. In other words, redundant noise_start tokens are ignored.
maximum_noise_subsegment_length (int, default=25) – Maximum length (in tokens) allowed for each noise subsegments in order to be removed. Larger noise subsegments are kept intact. This argument is useful to prevent removing larger chunks of text that might actually contain useful information.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
show_progress_bar (bool, default=False) – If True, show segmentation progress bar.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

segments (t.List[str]) – Segmented legal text.
justificativa (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
labels (npt.NDArray[np.int32] of shape (N,)) – Predicted labels for each token, where N is the length of tokenized document (in subword units). The -100 labels is a special legal, and ignored while computing the loss function during training. Only returned if return_labels=True.
logits (npt.NDArray[np.float64] of shape (N, C)) – Predicted logits for each token, where N is the length of tokenized document (in subword units), and C is equal to the Segmenter.NUM_CLASSES attribute. Only returned if return_logits=True.

class segmentador.optimize.models.TorchJITBERTSegmenter(uri_model: str, uri_tokenizer: Optional[str] = None, inference_pooling_operation: str = 'sum', local_files_only: bool = True, cache_dir_model: str = './cache/models', cache_dir_tokenizer: str = './cache/tokenizers', uri_model_extension: str = '.pt')

Bases: _TorchJITBaseSegmenter

BERT segmenter in Torch JIT format.

Uses a pretrained Transformer Encoder to segment Brazilian Portuguese legal texts. The pretrained models support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.

Parameters

uri_model (str) – URI to load pretrained model from. If local_files_only=True, then it must be a local file.
uri_tokenizer (str or None, default=None) – URI to pretrained text Tokenizer. If None, will assume that the tokenizer was serialized alongside the JIT model.
inference_pooling_operation ({'max', 'sum', 'gaussian', 'assymetric-max'}, default='sum') –
Specify the strategy used to combine logits during model inference for documents larger than 1024 subword tokens. Larger documents are sharded into possibly overlapping windows of 1024 subwords each. Thus, a single token may have multiple logits (and, therefore, predictions) associated with it. This argument defines how exactly the logits should be combined in order to derive the final verdict for that said token. The possible choices for this argument are:
- max: take the maximum logit of each token;
- sum: sum the logits associated with the same token;
- gaussian: build a gaussian filter that weights higher logits based on how close to the window center they are, diminishing its weights closer to the window limits; and
- assymetric-max: take the maximum logit of each token for all classes other than the No-operation class, which in turn receives the minimum among all corresponding logits instead.
local_files_only (bool, default=True) – If True, will search only for local pretrained model and tokenizers. If False, may download models from Huggingface HUB, if necessary.
cache_dir_model (str, default='./cache/models') – Cache directory for transformer encoder model.
cache_dir_tokenizer (str, default='./cache/tokenizers') – Cache directory for text tokenizer.
uri_model_extension (str, default='.pt') – Expected file extension of model local file. If uri_model does not ends with the provided extension, it will be appended to the end of URI before loading model.

NUM_CLASSES = 4

__call__(self, *args: Any, **kwargs: Any) → Union[List[str], Tuple[List[Any], Ellipsis]]

__repr__(self) → str: Return repr(self).

eval(self) → BaseSegmenter: Set model to evaluation mode.

train(self) → BaseSegmenter: Set model to train mode.

to(self, device: Union[str, torch.device]) → BaseSegmenter: Move underlying model to device.

property model(self) → Union[torch.nn.Module, transformers.BertForTokenClassification]

property tokenizer(self) → transformers.BertTokenizerFast

property RE_JUSTIFICATIVA(self) → regex.Pattern: Regular expression used to detect ‘justificativa’ blocks.

classmethod preprocess_legal_text(cls, text: str, return_justificativa: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[str, Tuple[str, List[str]]]

Apply minimal legal text preprocessing.

The preprocessing steps are: 1. Coalesce all blank spaces in text; 2. Remove all trailing and leading blank spaces; and 3. Pre-segment text into legal text content and justificativa.

Parameters

text (str) – Text to be preprocessed.
return_justificativa (bool, default=False) – If True, return a tuple in the format (content, justificativa). If False, return only content.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

preprocessed_text (str) – Content from text after the preprocessing steps.
justificativa_block (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.

generate_segments_from_ids(self, input_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], label_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], apply_postprocessing: bool = True) → List[str]

Generate segments from ids and labels.

Parameters

input_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Tokenized text from model’s tokenizer.
label_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Label ids for each token, where ‘label_id=1’ denotes the start of a new segment.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.

Returns

segments – List containing all segments in textual form.

Return type

t.List[str]

segment_legal_text(self, text: Union[str, Dict[str, List[int]]], batch_size: int = 32, moving_window_size: int = 512, window_shift_size: Union[float, int] = 0.25, return_justificativa: bool = False, return_labels: bool = False, return_logits: bool = False, remove_noise_subsegments: bool = False, maximum_noise_subsegment_length: int = 25, apply_postprocessing: bool = True, show_progress_bar: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[List[str], Tuple[List[Any], Ellipsis]]

Segment legal text.

The pretrained model support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.

The block size can be configured to smaller (not larger) values using the moving_window_size from BERTSegmenter.segment_legal_text method during inference.

Parameters

text (str or t.Dict[str, t.List[int]]) – Legal text to be segmented.
batch_size (int, default=32) – Maximum batch size feed document blocks in parallel to model. Higher values leads to faster inference with higher memory cost.
moving_window_size (int, default=512) – Moving window size, the maximum number of subwords feed in simultaneously to the segmenter model. Higher values leads to larger contexts for each token, at the expense of higher memory usage.
window_shift_size (int or float, default=0.25) –
Moving window shift size.
- If integer, specify the shift size per step exactly, and it must be in [1, 1024] range.
- If float, the shift size is calculated as window_shift_size * moving_window_size (rounded up), and it must be in the (0.0, 1.0] range.
Overlapping logits are combined using the strategy specified by the argument inference_pooling_operation in Segmenter model initialization.

The final prediction for each token is derived from the combined logits.
return_justificativa (bool, default=False) – If True, return contents from the ‘justificativa’ block from document.
return_labels (bool, default=False) – If True, return label list for each token.
return_logits (bool, default=False) – If True, return logit array for each token.
remove_noise_subsegments (bool, default=False) –
If True, remove all tokens between tokens classified as noise_start (inclusive) and noise_end or segment (either exclusive), whichever occurs first.
- Tokens classified as noise_end are kept. In other words, they are the first non-noise token past the previous noise subsegment.
- Tokens between noise_start and the sentence end are also removed.
- Tokens between the sentence end and noise_end are kept.
- Only the closest noise_start for every noise_end (or the sentence end) are considered. In other words, redundant noise_start tokens are ignored.
maximum_noise_subsegment_length (int, default=25) – Maximum length (in tokens) allowed for each noise subsegments in order to be removed. Larger noise subsegments are kept intact. This argument is useful to prevent removing larger chunks of text that might actually contain useful information.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
show_progress_bar (bool, default=False) – If True, show segmentation progress bar.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

segments (t.List[str]) – Segmented legal text.
justificativa (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
labels (npt.NDArray[np.int32] of shape (N,)) – Predicted labels for each token, where N is the length of tokenized document (in subword units). The -100 labels is a special legal, and ignored while computing the loss function during training. Only returned if return_labels=True.
logits (npt.NDArray[np.float64] of shape (N, C)) – Predicted logits for each token, where N is the length of tokenized document (in subword units), and C is equal to the Segmenter.NUM_CLASSES attribute. Only returned if return_logits=True.

class segmentador.optimize.models.TorchJITLSTMSegmenter(uri_model: str, uri_tokenizer: Optional[str] = None, inference_pooling_operation: str = 'sum', local_files_only: bool = True, cache_dir_model: str = './cache/models', cache_dir_tokenizer: str = './cache/tokenizers', uri_model_extension: str = '.pt')

Bases: _TorchJITBaseSegmenter

LSTM segmenter in Torch JIT format.

Parameters

uri_model (str) – URI to load pretrained model from. If local_files_only=True, then it must be a local file.
uri_tokenizer (str or None, default=None) – URI to pretrained text Tokenizer. If None, will assume that the tokenizer was serialized alongside the JIT model.
inference_pooling_operation ({'max', 'sum', 'gaussian', 'assymetric-max'}, default='sum') –
Specify the strategy used to combine logits during model inference for documents larger than 1024 subword tokens. Larger documents are sharded into possibly overlapping windows of 1024 subwords each. Thus, a single token may have multiple logits (and, therefore, predictions) associated with it. This argument defines how exactly the logits should be combined in order to derive the final verdict for that said token. The possible choices for this argument are:
- max: take the maximum logit of each token;
- sum: sum the logits associated with the same token;
- gaussian: build a gaussian filter that weights higher logits based on how close to the window center they are, diminishing its weights closer to the window limits; and
- assymetric-max: take the maximum logit of each token for all classes other than the No-operation class, which in turn receives the minimum among all corresponding logits instead.
local_files_only (bool, default=True) – If True, will search only for local pretrained model and tokenizers. If False, may download models from Huggingface HUB, if necessary.
cache_dir_model (str, default='./cache/models') – Cache directory for transformer encoder model.
cache_dir_tokenizer (str, default='./cache/tokenizers') – Cache directory for text tokenizer.
uri_model_extension (str, default='.pt') – Expected file extension of model local file. If uri_model does not ends with the provided extension, it will be appended to the end of URI before loading model.

NUM_CLASSES = 4

__call__(self, *args: Any, **kwargs: Any) → Union[List[str], Tuple[List[Any], Ellipsis]]

__repr__(self) → str: Return repr(self).

eval(self) → BaseSegmenter: Set model to evaluation mode.

train(self) → BaseSegmenter: Set model to train mode.

to(self, device: Union[str, torch.device]) → BaseSegmenter: Move underlying model to device.

property model(self) → Union[torch.nn.Module, transformers.BertForTokenClassification]

property tokenizer(self) → transformers.BertTokenizerFast

property RE_JUSTIFICATIVA(self) → regex.Pattern: Regular expression used to detect ‘justificativa’ blocks.

classmethod preprocess_legal_text(cls, text: str, return_justificativa: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[str, Tuple[str, List[str]]]

Apply minimal legal text preprocessing.

The preprocessing steps are: 1. Coalesce all blank spaces in text; 2. Remove all trailing and leading blank spaces; and 3. Pre-segment text into legal text content and justificativa.

Parameters

text (str) – Text to be preprocessed.
return_justificativa (bool, default=False) – If True, return a tuple in the format (content, justificativa). If False, return only content.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

preprocessed_text (str) – Content from text after the preprocessing steps.
justificativa_block (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.

generate_segments_from_ids(self, input_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], label_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], apply_postprocessing: bool = True) → List[str]

Generate segments from ids and labels.

Parameters

input_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Tokenized text from model’s tokenizer.
label_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Label ids for each token, where ‘label_id=1’ denotes the start of a new segment.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.

Returns

segments – List containing all segments in textual form.

Return type

t.List[str]

segment_legal_text(self, text: Union[str, Dict[str, List[int]]], batch_size: int = 32, moving_window_size: int = 512, window_shift_size: Union[float, int] = 0.25, return_justificativa: bool = False, return_labels: bool = False, return_logits: bool = False, remove_noise_subsegments: bool = False, maximum_noise_subsegment_length: int = 25, apply_postprocessing: bool = True, show_progress_bar: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) → Union[List[str], Tuple[List[Any], Ellipsis]]

Segment legal text.

The pretrained model support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.

The block size can be configured to smaller (not larger) values using the moving_window_size from BERTSegmenter.segment_legal_text method during inference.

Parameters

text (str or t.Dict[str, t.List[int]]) – Legal text to be segmented.
batch_size (int, default=32) – Maximum batch size feed document blocks in parallel to model. Higher values leads to faster inference with higher memory cost.
moving_window_size (int, default=512) – Moving window size, the maximum number of subwords feed in simultaneously to the segmenter model. Higher values leads to larger contexts for each token, at the expense of higher memory usage.
window_shift_size (int or float, default=0.25) –
Moving window shift size.
- If integer, specify the shift size per step exactly, and it must be in [1, 1024] range.
- If float, the shift size is calculated as window_shift_size * moving_window_size (rounded up), and it must be in the (0.0, 1.0] range.
Overlapping logits are combined using the strategy specified by the argument inference_pooling_operation in Segmenter model initialization.

The final prediction for each token is derived from the combined logits.
return_justificativa (bool, default=False) – If True, return contents from the ‘justificativa’ block from document.
return_labels (bool, default=False) – If True, return label list for each token.
return_logits (bool, default=False) – If True, return logit array for each token.
remove_noise_subsegments (bool, default=False) –
If True, remove all tokens between tokens classified as noise_start (inclusive) and noise_end or segment (either exclusive), whichever occurs first.
- Tokens classified as noise_end are kept. In other words, they are the first non-noise token past the previous noise subsegment.
- Tokens between noise_start and the sentence end are also removed.
- Tokens between the sentence end and noise_end are kept.
- Only the closest noise_start for every noise_end (or the sentence end) are considered. In other words, redundant noise_start tokens are ignored.
maximum_noise_subsegment_length (int, default=25) – Maximum length (in tokens) allowed for each noise subsegments in order to be removed. Larger noise subsegments are kept intact. This argument is useful to prevent removing larger chunks of text that might actually contain useful information.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
show_progress_bar (bool, default=False) – If True, show segmentation progress bar.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.

Returns

segments (t.List[str]) – Segmented legal text.
justificativa (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
labels (npt.NDArray[np.int32] of shape (N,)) – Predicted labels for each token, where N is the length of tokenized document (in subword units). The -100 labels is a special legal, and ignored while computing the loss function during training. Only returned if return_labels=True.
logits (npt.NDArray[np.float64] of shape (N, C)) – Predicted logits for each token, where N is the length of tokenized document (in subword units), and C is equal to the Segmenter.NUM_CLASSES attribute. Only returned if return_logits=True.

segmentador.optimize.models

Module Contents

Classes

`segmentador.optimize.models`