Note
Access to this page requires authorization. You can try signing in or .
Access to this page requires authorization. You can try .
TextCatalog.TokenizeIntoCharactersAsKeys Method
Definition
- Namespace:
- Microsoft.ML
- Assembly:
- Microsoft.ML.Transforms.dll
- Package:
- Microsoft.ML v4.0.1
- Package:
- Microsoft.ML v1.0.0
- Package:
- Microsoft.ML v1.1.0
- Package:
- Microsoft.ML v1.2.0
- Package:
- Microsoft.ML v1.3.1
- Package:
- Microsoft.ML v1.4.0
- Package:
- Microsoft.ML v1.5.5
- Package:
- Microsoft.ML v1.6.0
- Package:
- Microsoft.ML v1.7.0
- Package:
- Microsoft.ML v2.0.1
- Package:
- Microsoft.ML v3.0.1
- Package:
- Microsoft.ML v5.0.0-preview.1.25125.4
- Source:
- TextCatalog.cs
- Source:
- TextCatalog.cs
- Source:
- TextCatalog.cs
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Create a TokenizingByCharactersEstimator, which tokenizes by splitting text into sequences of characters using a sliding window.
public static Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator TokenizeIntoCharactersAsKeys(this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, bool useMarkerCharacters = true);
static member TokenizeIntoCharactersAsKeys : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * bool -> Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator
<Extension()>
Public Function TokenizeIntoCharactersAsKeys (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional useMarkerCharacters As Boolean = true) As TokenizingByCharactersEstimator
Parameters
- catalog
- TransformsCatalog.TextTransforms
The text-related transform's catalog.
- outputColumnName
- String
Name of the column resulting from the transformation of inputColumnName.
This column's data type will be a variable-sized vector of keys.
- inputColumnName
- String
Name of the column to transform. If set to null, the value of the
outputColumnName will be used as source.
This estimator operates over text data type.
- useMarkerCharacters
- Boolean
To be able to distinguish the tokens, for example for debugging purposes,
you can choose to prepend a marker character, 0x02, to the beginning,
and append another marker character, 0x03, to the end of the output vector of characters.
Returns
Examples
using System;
using System.Collections.Generic;
using Microsoft.ML;
namespace Samples.Dynamic
{
public static class TokenizeIntoCharactersAsKeys
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for
// exception tracking and logging, as well as the source of randomness.
var mlContext = new MLContext();
// Create an empty list as the dataset. The
// 'TokenizeIntoCharactersAsKeys' does not require training data as
// the estimator ('TokenizingByCharactersEstimator') created by
// 'TokenizeIntoCharactersAsKeys' API is not a trainable estimator.
// The empty list is only needed to pass input schema to the pipeline.
var emptySamples = new List<TextData>();
// Convert sample list to an empty IDataView.
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);
// A pipeline for converting text into vector of characters.
// The 'TokenizeIntoCharactersAsKeys' produces result as key type.
// 'MapKeyToValue' is need to map keys back to their original values.
var textPipeline = mlContext.Transforms.Text
.TokenizeIntoCharactersAsKeys("CharTokens", "Text",
useMarkerCharacters: false)
.Append(mlContext.Transforms.Conversion.MapKeyToValue(
"CharTokens"));
// Fit to data.
var textTransformer = textPipeline.Fit(emptyDataView);
// Create the prediction engine to get the character vector from the
// input text/string.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
TransformedTextData>(textTransformer);
// Call the prediction API to convert the text into characters.
var data = new TextData()
{
Text = "ML.NET's " +
"TokenizeIntoCharactersAsKeys API splits text/string into " +
"characters."
};
var prediction = predictionEngine.Predict(data);
// Print the length of the character vector.
Console.WriteLine($"Number of tokens: {prediction.CharTokens.Length}");
// Print the character vector.
Console.WriteLine("\nCharacter Tokens: " + string.Join(",", prediction
.CharTokens));
// Expected output:
// Number of tokens: 77
// Character Tokens: M,L,.,N,E,T,',s,<?>,T,o,k,e,n,i,z,e,I,n,t,o,C,h,a,r,a,c,t,e,r,s,A,s,K,e,y,s,<?>,A,P,I,<?>,
// s,p,l,i,t,s,<?>,t,e,x,t,/,s,t,r,i,n,g,<?>,i,n,t,o,<?>,c,h,a,r,a,c,t,e,r,s,.
//
// <?>: is a unicode control character used instead of spaces ('\u2400').
}
private class TextData
{
public string Text { get; set; }
}
private class TransformedTextData : TextData
{
public string[] CharTokens { get; set; }
}
}
}
