COUNT_VECTORIZER

The COUNT_VECTORIZER node receives a collection (matrix, vector or dataframe) of text documents and converts it to a matrix of token counts.Params:default : DataFrame|Matrix|VectorThe corpus to vectorize.Returns:tokens : DataFrameHolds all the unique tokens observed from the input.word_count_vector : VectorContains the occurences of these tokens from each sentence.

Python Code

from typing import TypedDict
from sklearn.feature_extraction.text import CountVectorizer
from flojoy import flojoy, DataFrame, Matrix, Vector
import pandas as pd


class CountVectorizerOutput(TypedDict):
    tokens: DataFrame
    word_count_vector: Vector


@flojoy(deps={"scikit-learn": "1.2.2"})
def COUNT_VECTORIZER(default: DataFrame | Matrix | Vector) -> CountVectorizerOutput:
    """The COUNT_VECTORIZER node receives a collection (matrix, vector or dataframe) of text documents and converts it to a matrix of token counts.

    Parameters
    ----------
    default : DataFrame|Matrix|Vector
        The corpus to vectorize.

    Returns
    -------
    tokens: DataFrame
        Holds all the unique tokens observed from the input.
    word_count_vector: Vector
        Contains the occurences of these tokens from each sentence.
    """

    if isinstance(default, DataFrame):
        data = default.m.values
    elif isinstance(default, Vector):
        data = default.v
    else:
        data = default.m

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(data.flatten())

    x = pd.DataFrame({"tokens": vectorizer.get_feature_names_out()})
    y = X.toarray()  # type: ignore

    return CountVectorizerOutput(tokens=DataFrame(df=x), word_count_vector=Vector(v=y))

Find this Flojoy Block on GitHub

Example

Having problem with this example app? Join our Discord community and we will help you out!

In this example, the READ_CSV node loads a local file. Then COUNT_VECTORIZER node transforms the received dataframe of text into a matrix of token/word counts, and it returns a DataFrame that contains unique words and a Matrix that contains token counts for each sentence.