segmentador.segmenter
Legal text segmenter.
Module Contents
Classes
BERT segmenter model for Brazilian legislative bills. |
|
Bi-LSTM segmenter model for Brazilian legislative bills. |
Attributes
- class segmentador.segmenter.BERTSegmenter(uri_model: str = '4_layer_6000_vocab_size_bert_v3', uri_tokenizer: Optional[str] = None, inference_pooling_operation: str = 'sum', local_files_only: bool = False, device: str = 'cpu', init_from_pretrained_weights: bool = True, config: Optional[Union[transformers.BertConfig, transformers.PretrainedConfig]] = None, num_labels: int = 4, num_hidden_layers: int = 6, cache_dir_model: str = './cache/models', cache_dir_tokenizer: str = './cache/tokenizers', uri_model_extension: str = '', show_download_progress_bar: bool = True)
Bases:
segmentador._base.BaseSegmenterBERT segmenter model for Brazilian legislative bills.
- Parameters
uri_model (str, default='4_layer_6000_vocab_size_bert_v3') – URI to load pretrained model from. May be a valid pretrained Ulysses segmenter model, a Huggingface HUB URL, or a local file (mandatory when local_files_only=True). See [1] for more information about pretrained Ulysses segmenter models.
uri_tokenizer (str or None, default=None) – URI to pretrained text Tokenizer. If None, will load the tokenizer from the uri_model path.
inference_pooling_operation ({'max', 'sum', 'gaussian', 'assymetric-max'}, default='sum') –
Specify the strategy used to combine logits during model inference for documents larger than 1024 subword tokens. Larger documents are sharded into possibly overlapping windows of 1024 subwords each. Thus, a single token may have multiple logits (and, therefore, predictions) associated with it. This argument defines how exactly the logits should be combined in order to derive the final verdict for that said token. The possible choices for this argument are:
max: take the maximum logit of each token;
sum: sum the logits associated with the same token;
gaussian: build a gaussian filter that weights higher logits based on how close to the window center they are, diminishing its weights closer to the window limits; and
assymetric-max: take the maximum logit of each token for all classes other than the No-operation class, which in turn receives the minimum among all corresponding logits instead.
local_files_only (bool, default=False) – If True, will search only for local pretrained model and tokenizers. If False, may download pretrained Ulysses models or models from Huggingface HUB, when necessary.
device ({'cpu', 'cuda'}, default='cpu') – Device to segment document content.
init_from_pretrained_weights (bool, default=True) – if True, load pretrained weights from the specified uri_model argument. If False, load only the model configuration from the same argument.
config (transformers.BertConfig or None, default=None) –
Custom model configuration. Used only if init_from_pretrained_weights=False. If init_from_pretrained_weights=False and config=None, will load the configuration file from uri_model with the following changes:
config.max_position_embeddings = 1024
config.num_hidden_layers = num_hidden_layers
config.num_labels = num_labels
num_labels (int, default=4) – Number of labels in the configuration file.
num_hidden_layers (int, default=6) – Number of maximum Transformer Encoder hidden layers. If the model has more hidden layers than the specified value in this parameter, later hidden layers will be removed.
cache_dir_model (str, default='./cache/models') – Cache directory for transformer encoder model.
cache_dir_tokenizer (str, default='./cache/tokenizers') – Cache directory for text tokenizer.
uri_model_extension (str, default='') – Expected file extension of model local file. If uri_model does not ends with the provided extension, it will be appended to the end of URI before loading model.
show_download_progress_bar (bool, default=True) – If True, show download progress bar for pretrained Ulysses models. Note that progress bar related to Huggingface HUB can still be shown regardless of this parameter.
References
- 1
About pretrained models in Ulysses Segmenter documentation, at GitHub (2022). URL: https://github.com/ulysses-camara/ulysses-segmenter#trained-models
- NUM_CLASSES = 4
- __call__(self, *args: Any, **kwargs: Any) Union[List[str], Tuple[List[Any], Ellipsis]]
- __repr__(self) str
Return repr(self).
- eval(self) BaseSegmenter
Set model to evaluation mode.
- train(self) BaseSegmenter
Set model to train mode.
- to(self, device: Union[str, torch.device]) BaseSegmenter
Move underlying model to device.
- property model(self) Union[torch.nn.Module, transformers.BertForTokenClassification]
- property tokenizer(self) transformers.BertTokenizerFast
- property RE_JUSTIFICATIVA(self) regex.Pattern
Regular expression used to detect ‘justificativa’ blocks.
- classmethod preprocess_legal_text(cls, text: str, return_justificativa: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) Union[str, Tuple[str, List[str]]]
Apply minimal legal text preprocessing.
The preprocessing steps are: 1. Coalesce all blank spaces in text; 2. Remove all trailing and leading blank spaces; and 3. Pre-segment text into legal text content and justificativa.
- Parameters
text (str) – Text to be preprocessed.
return_justificativa (bool, default=False) – If True, return a tuple in the format (content, justificativa). If False, return only content.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.
- Returns
preprocessed_text (str) – Content from text after the preprocessing steps.
justificativa_block (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
- generate_segments_from_ids(self, input_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], label_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], apply_postprocessing: bool = True) List[str]
Generate segments from ids and labels.
- Parameters
input_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Tokenized text from model’s tokenizer.
label_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Label ids for each token, where ‘label_id=1’ denotes the start of a new segment.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
- Returns
segments – List containing all segments in textual form.
- Return type
t.List[str]
- segment_legal_text(self, text: Union[str, Dict[str, List[int]]], batch_size: int = 32, moving_window_size: int = 512, window_shift_size: Union[float, int] = 0.25, return_justificativa: bool = False, return_labels: bool = False, return_logits: bool = False, remove_noise_subsegments: bool = False, maximum_noise_subsegment_length: int = 25, apply_postprocessing: bool = True, show_progress_bar: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) Union[List[str], Tuple[List[Any], Ellipsis]]
Segment legal text.
The pretrained model support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.
The block size can be configured to smaller (not larger) values using the moving_window_size from BERTSegmenter.segment_legal_text method during inference.
- Parameters
text (str or t.Dict[str, t.List[int]]) – Legal text to be segmented.
batch_size (int, default=32) – Maximum batch size feed document blocks in parallel to model. Higher values leads to faster inference with higher memory cost.
moving_window_size (int, default=512) – Moving window size, the maximum number of subwords feed in simultaneously to the segmenter model. Higher values leads to larger contexts for each token, at the expense of higher memory usage.
window_shift_size (int or float, default=0.25) –
Moving window shift size.
If integer, specify the shift size per step exactly, and it must be in [1, 1024] range.
If float, the shift size is calculated as window_shift_size * moving_window_size (rounded up), and it must be in the (0.0, 1.0] range.
Overlapping logits are combined using the strategy specified by the argument inference_pooling_operation in Segmenter model initialization.
The final prediction for each token is derived from the combined logits.
return_justificativa (bool, default=False) – If True, return contents from the ‘justificativa’ block from document.
return_labels (bool, default=False) – If True, return label list for each token.
return_logits (bool, default=False) – If True, return logit array for each token.
remove_noise_subsegments (bool, default=False) –
If True, remove all tokens between tokens classified as noise_start (inclusive) and noise_end or segment (either exclusive), whichever occurs first.
Tokens classified as noise_end are kept. In other words, they are the first non-noise token past the previous noise subsegment.
Tokens between noise_start and the sentence end are also removed.
Tokens between the sentence end and noise_end are kept.
Only the closest noise_start for every noise_end (or the sentence end) are considered. In other words, redundant noise_start tokens are ignored.
maximum_noise_subsegment_length (int, default=25) – Maximum length (in tokens) allowed for each noise subsegments in order to be removed. Larger noise subsegments are kept intact. This argument is useful to prevent removing larger chunks of text that might actually contain useful information.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
show_progress_bar (bool, default=False) – If True, show segmentation progress bar.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.
- Returns
segments (t.List[str]) – Segmented legal text.
justificativa (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
labels (npt.NDArray[np.int32] of shape (N,)) – Predicted labels for each token, where N is the length of tokenized document (in subword units). The -100 labels is a special legal, and ignored while computing the loss function during training. Only returned if return_labels=True.
logits (npt.NDArray[np.float64] of shape (N, C)) – Predicted logits for each token, where N is the length of tokenized document (in subword units), and C is equal to the Segmenter.NUM_CLASSES attribute. Only returned if return_logits=True.
- segmentador.segmenter.Segmenter
- class segmentador.segmenter.LSTMSegmenter(uri_model: str = '256_hidden_dim_6000_vocab_size_1_layer_lstm_v3', uri_tokenizer: str = '6000_subword_tokenizer', inference_pooling_operation: str = 'sum', local_files_only: bool = False, device: str = 'cpu', from_quantized_weights: bool = False, lstm_hidden_layer_size: Optional[int] = None, lstm_num_layers: Optional[int] = None, cache_dir_model: str = './cache/models', cache_dir_tokenizer: str = './cache/tokenizers', uri_model_extension: str = '.pt', show_download_progress_bar: bool = True)
Bases:
segmentador._base.BaseSegmenterBi-LSTM segmenter model for Brazilian legislative bills.
- Parameters
uri_model (str, default='256_hidden_dim_6000_vocab_size_1_layer_lstm_v3') – URI to load pretrained model from. May be a valid pretrained Ulysses segmenter model, or a local file (mandatory when local_files_only=True). See [1] for more information about pretrained Ulysses segmenter models.
uri_tokenizer (str, default='6000_subword_tokenizer') – URI to pretrained text Tokenizer.
inference_pooling_operation ({'max', 'sum', 'gaussian', 'assymetric-max'}, default='sum') –
Specify the strategy used to combine logits during model inference for documents larger than moving_window_size subword tokens (see LSTMSegmenter.segment_legal_text documentation). Larger documents are sharded into possibly overlapping windows of moving_window_size subwords each. Thus, a single token may have multiple logits (and, therefore, predictions) associated with it. This argument defines how exactly the logits should be combined in order to derive the final verdict for that said token. The possible choices for this argument are:
max: take the maximum logit of each token;
sum: sum the logits associated with the same token;
gaussian: build a gaussian filter that weights higher logits based on how close to the window center they are, diminishing its weights closer to the window limits; and
assymetric-max: take the maximum logit of each token for all classes other than the No-operation class, which in turn receives the minimum among all corresponding logits instead.
local_files_only (bool, default=False) – If True, will search only for local pretrained model and tokenizers. If False, may download pretrained Ulysses models or models from Huggingface HUB, when necessary.
device ({'cpu', 'cuda'}, default='cpu') – Device to segment document content.
from_quantized_weights (bool, default=False) – Set to True if the pretrained weights where previously quantized (from FP32 to UINT8), in Torch format. Not that this option is not meant to support quantization strategies provided by optimize.quantize_model, but any other quantization strategies of external nature.
lstm_hidden_layer_size (int) – Dimension of LSTM model hidden layer.
lstm_num_layers (int) – Number of layers in LSTM model.
cache_dir_model (str, default='./cache/models') – Cache directory for LSTM model.
cache_dir_tokenizer (str, default='./cache/tokenizers') – Cache directory for text tokenizer.
uri_model_extension (str, default='') – Expected file extension of model local file. If uri_model does not ends with the provided extension, it will be appended to the end of URI before loading model.
show_download_progress_bar (bool, default=True) – If True, show download progress bar for pretrained Ulysses models. Note that progress bar related to Huggingface HUB can still be shown regardless of this parameter.
See also
optimize.quantize_modelcreate a quantized model from an existing Segmenter model.
References
- 1
About pretrained models in Ulysses Segmenter documentation, at GitHub (2022). URL: https://github.com/ulysses-camara/ulysses-segmenter#trained-models
- NUM_CLASSES = 4
- __call__(self, *args: Any, **kwargs: Any) Union[List[str], Tuple[List[Any], Ellipsis]]
- __repr__(self) str
Return repr(self).
- eval(self) BaseSegmenter
Set model to evaluation mode.
- train(self) BaseSegmenter
Set model to train mode.
- to(self, device: Union[str, torch.device]) BaseSegmenter
Move underlying model to device.
- property model(self) Union[torch.nn.Module, transformers.BertForTokenClassification]
- property tokenizer(self) transformers.BertTokenizerFast
- property RE_JUSTIFICATIVA(self) regex.Pattern
Regular expression used to detect ‘justificativa’ blocks.
- classmethod preprocess_legal_text(cls, text: str, return_justificativa: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) Union[str, Tuple[str, List[str]]]
Apply minimal legal text preprocessing.
The preprocessing steps are: 1. Coalesce all blank spaces in text; 2. Remove all trailing and leading blank spaces; and 3. Pre-segment text into legal text content and justificativa.
- Parameters
text (str) – Text to be preprocessed.
return_justificativa (bool, default=False) – If True, return a tuple in the format (content, justificativa). If False, return only content.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.
- Returns
preprocessed_text (str) – Content from text after the preprocessing steps.
justificativa_block (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
- generate_segments_from_ids(self, input_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], label_ids: Union[Sequence[int], numpy.typing.NDArray[numpy.int64]], apply_postprocessing: bool = True) List[str]
Generate segments from ids and labels.
- Parameters
input_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Tokenized text from model’s tokenizer.
label_ids (t.Sequence[int] or npt.NDArray[np.int64]) – Label ids for each token, where ‘label_id=1’ denotes the start of a new segment.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
- Returns
segments – List containing all segments in textual form.
- Return type
t.List[str]
- segment_legal_text(self, text: Union[str, Dict[str, List[int]]], batch_size: int = 32, moving_window_size: int = 512, window_shift_size: Union[float, int] = 0.25, return_justificativa: bool = False, return_labels: bool = False, return_logits: bool = False, remove_noise_subsegments: bool = False, maximum_noise_subsegment_length: int = 25, apply_postprocessing: bool = True, show_progress_bar: bool = False, regex_justificativa: Optional[Union[str, regex.Pattern]] = None) Union[List[str], Tuple[List[Any], Ellipsis]]
Segment legal text.
The pretrained model support texts up to 1024 subwords. Texts larger than this value are pre-segmented into 1024 subword blocks, and each block is feed to the segmenter individually.
The block size can be configured to smaller (not larger) values using the moving_window_size from BERTSegmenter.segment_legal_text method during inference.
- Parameters
text (str or t.Dict[str, t.List[int]]) – Legal text to be segmented.
batch_size (int, default=32) – Maximum batch size feed document blocks in parallel to model. Higher values leads to faster inference with higher memory cost.
moving_window_size (int, default=512) – Moving window size, the maximum number of subwords feed in simultaneously to the segmenter model. Higher values leads to larger contexts for each token, at the expense of higher memory usage.
window_shift_size (int or float, default=0.25) –
Moving window shift size.
If integer, specify the shift size per step exactly, and it must be in [1, 1024] range.
If float, the shift size is calculated as window_shift_size * moving_window_size (rounded up), and it must be in the (0.0, 1.0] range.
Overlapping logits are combined using the strategy specified by the argument inference_pooling_operation in Segmenter model initialization.
The final prediction for each token is derived from the combined logits.
return_justificativa (bool, default=False) – If True, return contents from the ‘justificativa’ block from document.
return_labels (bool, default=False) – If True, return label list for each token.
return_logits (bool, default=False) – If True, return logit array for each token.
remove_noise_subsegments (bool, default=False) –
If True, remove all tokens between tokens classified as noise_start (inclusive) and noise_end or segment (either exclusive), whichever occurs first.
Tokens classified as noise_end are kept. In other words, they are the first non-noise token past the previous noise subsegment.
Tokens between noise_start and the sentence end are also removed.
Tokens between the sentence end and noise_end are kept.
Only the closest noise_start for every noise_end (or the sentence end) are considered. In other words, redundant noise_start tokens are ignored.
maximum_noise_subsegment_length (int, default=25) – Maximum length (in tokens) allowed for each noise subsegments in order to be removed. Larger noise subsegments are kept intact. This argument is useful to prevent removing larger chunks of text that might actually contain useful information.
apply_postprocessing (bool, default=True) – If True, remove spurious whitespaces next to punctuation marks in the output.
show_progress_bar (bool, default=False) – If True, show segmentation progress bar.
regex_justificativa (str, regex.Pattern or None, default=None) – Regular expression specifying how the justificativa portion from legal documents should be detected. If None, will use the pattern predefined in Segmenter.RE_JUSTIFICATIVA class attribute.
- Returns
segments (t.List[str]) – Segmented legal text.
justificativa (t.List[str]) – Detected legal text justificativa blocks. Only returned if return_justificativa=True.
labels (npt.NDArray[np.int32] of shape (N,)) – Predicted labels for each token, where N is the length of tokenized document (in subword units). The -100 labels is a special legal, and ignored while computing the loss function during training. Only returned if return_labels=True.
logits (npt.NDArray[np.float64] of shape (N, C)) – Predicted logits for each token, where N is the length of tokenized document (in subword units), and C is equal to the Segmenter.NUM_CLASSES attribute. Only returned if return_logits=True.