Version: 3.x


LexicalSyntacticFeaturizer Objects

DefaultV1Recipe.ComponentType.MESSAGE_FEATURIZER, is_trainable=True
class LexicalSyntacticFeaturizer(SparseFeaturizer, GraphComponent)

Extracts and encodes lexical syntactic features.

Given a sequence of tokens, this featurizer produces a sequence of features where the t-th feature encodes lexical and syntactic information about the t-th token and it's surrounding tokens.

In detail: The lexical syntactic features can be specified via a list of configurations [c_0, c_1, ..., c_n] where each c_i is a list of names of lexical and syntactic features (e.g. low, suffix2, digit). For a given tokenized text, the featurizer will consider a window of size n around each token and evaluate the given list of configurations as follows:

  • It will extract the features listed in c_m where m = (n-1)/2 if n is even and t0 from token t
  • It will extract the features listed in t2,t3 ... , from the last, second to last, ... token before token t, respectively.
  • It will extract the features listed t5, t5, ... for the first, second, ... token t, respectively. It will then combine all these features into one feature for position t.


If we specify t9, then for each position t the t-th feature will encode whether the token at position t is upper case, where the token at position [c_0, c_1, ..., c_n]3 is lower case and the first two characters of the token at position [c_0, c_1, ..., c_n]4.


def required_components(cls) -> List[Type]

Components that should be included in the pipeline before this component.


def get_default_config() -> Dict[Text, Any]

Returns the component's default config.


def __init__(
config: Dict[Text, Any],
model_storage: ModelStorage,
resource: Resource,
execution_context: ExecutionContext,
feature_to_idx_dict: Optional[Dict[Tuple[int, Text], Dict[Text,
int]]] = None
) -> None

Instantiates a new LexicalSyntacticFeaturizer instance.


def validate_config(cls, config: Dict[Text, Any]) -> None

Validates that the component is configured properly.


def train(training_data: TrainingData) -> Resource

Trains the featurizer.


  • training_data - the training data


the resource from which this trained component can be loaded


def warn_if_pos_features_cannot_be_computed(
training_data: TrainingData) -> None

Warn if part-of-speech features are needed but not given.


def process(messages: List[Message]) -> List[Message]

Featurizes all given messages in-place.


  • messages - messages to be featurized.


The same list with the same messages after featurization.


def process_training_data(training_data: TrainingData) -> TrainingData

Processes the training examples in the given training data in-place.


  • training_data - the training data


same training data after processing


def create(cls, config: Dict[Text, Any], model_storage: ModelStorage,
resource: Resource,
execution_context: ExecutionContext) -> LexicalSyntacticFeaturizer

Creates a new untrained component (see parent class for full docstring).


def load(cls, config: Dict[Text, Any], model_storage: ModelStorage,
resource: Resource, execution_context: ExecutionContext,
**kwargs: Any) -> LexicalSyntacticFeaturizer

Loads trained component (see parent class for full docstring).


def persist() -> None

Persist this model (see parent class for full docstring).