TextCatalog.TokenizeIntoCharactersAsKeys Method

Definition

Namespace:: Microsoft.ML

Assembly:: Microsoft.ML.Transforms.dll

Package:: Microsoft.ML v4.0.1

Package:: Microsoft.ML v1.0.0

Package:: Microsoft.ML v1.1.0

Package:: Microsoft.ML v1.2.0

Package:: Microsoft.ML v1.3.1

Package:: Microsoft.ML v1.4.0

Package:: Microsoft.ML v1.5.5

Package:: Microsoft.ML v1.6.0

Package:: Microsoft.ML v1.7.0

Package:: Microsoft.ML v2.0.1

Package:: Microsoft.ML v3.0.1

Package:: Microsoft.ML v5.0.0-preview.1.25125.4

Source:: TextCatalog.cs

Source:: TextCatalog.cs

Source:: TextCatalog.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Create a TokenizingByCharactersEstimator, which tokenizes by splitting text into sequences of characters using a sliding window.

public static Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator TokenizeIntoCharactersAsKeys(this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, bool useMarkerCharacters = true);

static member TokenizeIntoCharactersAsKeys : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * bool -> Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator

<Extension()>
Public Function TokenizeIntoCharactersAsKeys (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional useMarkerCharacters As Boolean = true) As TokenizingByCharactersEstimator

Parameters

catalog: TransformsCatalog.TextTransforms

The text-related transform's catalog.

outputColumnName: String

Name of the column resulting from the transformation of inputColumnName. This column's data type will be a variable-sized vector of keys.

inputColumnName: String

Name of the column to transform. If set to null, the value of the outputColumnName will be used as source. This estimator operates over text data type.

useMarkerCharacters: Boolean

To be able to distinguish the tokens, for example for debugging purposes, you can choose to prepend a marker character, 0x02, to the beginning, and append another marker character, 0x03, to the end of the output vector of characters.

Returns

TokenizingByCharactersEstimator

Examples

using System;
using System.Collections.Generic;
using Microsoft.ML;

namespace Samples.Dynamic
{
 public static class TokenizeIntoCharactersAsKeys
 {
 public static void Example()
 {
 // Create a new ML context, for ML.NET operations. It can be used for
 // exception tracking and logging, as well as the source of randomness.
 var mlContext = new MLContext();

 // Create an empty list as the dataset. The
 // 'TokenizeIntoCharactersAsKeys' does not require training data as
 // the estimator ('TokenizingByCharactersEstimator') created by
 // 'TokenizeIntoCharactersAsKeys' API is not a trainable estimator.
 // The empty list is only needed to pass input schema to the pipeline.
 var emptySamples = new List<TextData>();

 // Convert sample list to an empty IDataView.
 var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);

 // A pipeline for converting text into vector of characters.
 // The 'TokenizeIntoCharactersAsKeys' produces result as key type.
 // 'MapKeyToValue' is need to map keys back to their original values.
 var textPipeline = mlContext.Transforms.Text
 .TokenizeIntoCharactersAsKeys("CharTokens", "Text",
 useMarkerCharacters: false)
 .Append(mlContext.Transforms.Conversion.MapKeyToValue(
 "CharTokens"));

 // Fit to data.
 var textTransformer = textPipeline.Fit(emptyDataView);

 // Create the prediction engine to get the character vector from the
 // input text/string.
 var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
 TransformedTextData>(textTransformer);

 // Call the prediction API to convert the text into characters.
 var data = new TextData()
 {
 Text = "ML.NET's " +
 "TokenizeIntoCharactersAsKeys API splits text/string into " +
 "characters."
 };

 var prediction = predictionEngine.Predict(data);

 // Print the length of the character vector.
 Console.WriteLine($"Number of tokens: {prediction.CharTokens.Length}");

 // Print the character vector.
 Console.WriteLine("\nCharacter Tokens: " + string.Join(",", prediction
 .CharTokens));

 // Expected output:
 // Number of tokens: 77
 // Character Tokens: M,L,.,N,E,T,',s,<?>,T,o,k,e,n,i,z,e,I,n,t,o,C,h,a,r,a,c,t,e,r,s,A,s,K,e,y,s,<?>,A,P,I,<?>,
 // s,p,l,i,t,s,<?>,t,e,x,t,/,s,t,r,i,n,g,<?>,i,n,t,o,<?>,c,h,a,r,a,c,t,e,r,s,.
 //
 // <?>: is a unicode control character used instead of spaces ('\u2400').
 }

 private class TextData
 {
 public string Text { get; set; }
 }

 private class TransformedTextData : TextData
 {
 public string[] CharTokens { get; set; }
 }
 }
}

URL: https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.textcatalog.tokenizeintocharactersaskeys?view=ml-dotnet-preview