Note
Access to this page requires authorization. You can try signing in or .
Access to this page requires authorization. You can try .
TextCatalog.LatentDirichletAllocation Method
Definition
- Namespace:
- Microsoft.ML
- Assembly:
- Microsoft.ML.Transforms.dll
- Package:
- Microsoft.ML v4.0.1
- Package:
- Microsoft.ML v1.0.0
- Package:
- Microsoft.ML v1.1.0
- Package:
- Microsoft.ML v1.2.0
- Package:
- Microsoft.ML v1.3.1
- Package:
- Microsoft.ML v1.4.0
- Package:
- Microsoft.ML v1.5.5
- Package:
- Microsoft.ML v1.6.0
- Package:
- Microsoft.ML v1.7.0
- Package:
- Microsoft.ML v2.0.1
- Package:
- Microsoft.ML v3.0.1
- Package:
- Microsoft.ML v5.0.0-preview.1.25125.4
- Source:
- TextCatalog.cs
- Source:
- TextCatalog.cs
- Source:
- TextCatalog.cs
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Create a LatentDirichletAllocationEstimator, which uses LightLDA to transform text (represented as a vector of floats) into a vector of Single indicating the similarity of the text with each topic identified.
public static Microsoft.ML.Transforms.Text.LatentDirichletAllocationEstimator LatentDirichletAllocation(this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, int numberOfTopics = 100, float alphaSum = 100, float beta = 0.01, int samplingStepCount = 4, int maximumNumberOfIterations = 200, int likelihoodInterval = 5, int numberOfThreads = 0, int maximumTokenCountPerDocument = 512, int numberOfSummaryTermsPerTopic = 10, int numberOfBurninIterations = 10, bool resetRandomGenerator = false);
static member LatentDirichletAllocation : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * int * single * single * int * int * int * int * int * int * int * bool -> Microsoft.ML.Transforms.Text.LatentDirichletAllocationEstimator
<Extension()>
Public Function LatentDirichletAllocation (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional numberOfTopics As Integer = 100, Optional alphaSum As Single = 100, Optional beta As Single = 0.01, Optional samplingStepCount As Integer = 4, Optional maximumNumberOfIterations As Integer = 200, Optional likelihoodInterval As Integer = 5, Optional numberOfThreads As Integer = 0, Optional maximumTokenCountPerDocument As Integer = 512, Optional numberOfSummaryTermsPerTopic As Integer = 10, Optional numberOfBurninIterations As Integer = 10, Optional resetRandomGenerator As Boolean = false) As LatentDirichletAllocationEstimator
Parameters
- catalog
- TransformsCatalog.TextTransforms
The transform's catalog.
- outputColumnName
- String
Name of the column resulting from the transformation of inputColumnName.
This estimator outputs a vector of Single.
- inputColumnName
- String
Name of the column to transform. If set to null, the value of the outputColumnName will be used as source.
This estimator operates over a vector of Single.
- numberOfTopics
- Int32
The number of topics.
- alphaSum
- Single
Dirichlet prior on document-topic vectors.
- beta
- Single
Dirichlet prior on vocab-topic vectors.
- samplingStepCount
- Int32
Number of Metropolis Hasting step.
- maximumNumberOfIterations
- Int32
Number of iterations.
- likelihoodInterval
- Int32
Compute log likelihood over local dataset on this iteration interval.
- numberOfThreads
- Int32
The number of training threads. Default value depends on number of logical processors.
- maximumTokenCountPerDocument
- Int32
The threshold of maximum count of tokens per doc.
- numberOfSummaryTermsPerTopic
- Int32
The number of words to summarize the topic.
- numberOfBurninIterations
- Int32
The number of burn-in iterations.
- resetRandomGenerator
- Boolean
Reset the random number generator for each document.
Returns
Examples
using System;
using System.Collections.Generic;
using Microsoft.ML;
namespace Samples.Dynamic
{
public static class LatentDirichletAllocation
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for
// exception tracking and logging, as well as the source of randomness.
var mlContext = new MLContext();
// Create a small dataset as an IEnumerable.
var samples = new List<TextData>()
{
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API " +
"computes topic models." },
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API " +
"is the best for topic models." },
new TextData(){ Text = "I like to eat broccoli and bananas." },
new TextData(){ Text = "I eat bananas for breakfast." },
new TextData(){ Text = "This car is expensive compared to last " +
"week's price." },
new TextData(){ Text = "This car was $X last week." },
};
// Convert training data to IDataView.
var dataview = mlContext.Data.LoadFromEnumerable(samples);
// A pipeline for featurizing the text/string using
// LatentDirichletAllocation API. o be more accurate in computing the
// LDA features, the pipeline first normalizes text and removes stop
// words before passing tokens (the individual words, lower cased, with
// common words removed) to LatentDirichletAllocation.
var pipeline = mlContext.Transforms.Text.NormalizeText("NormalizedText",
"Text")
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens",
"NormalizedText"))
.Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Tokens"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
.Append(mlContext.Transforms.Text.ProduceNgrams("Tokens"))
.Append(mlContext.Transforms.Text.LatentDirichletAllocation(
"Features", "Tokens", numberOfTopics: 3));
// Fit to data.
var transformer = pipeline.Fit(dataview);
// Create the prediction engine to get the LDA features extracted from
// the text.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
TransformedTextData>(transformer);
// Convert the sample text into LDA features and print it.
PrintLdaFeatures(predictionEngine.Predict(samples[0]));
PrintLdaFeatures(predictionEngine.Predict(samples[1]));
// Features obtained post-transformation.
// For LatentDirichletAllocation, we had specified numTopic:3. Hence
// each prediction has been featurized as a vector of floats with length
// 3.
// Topic1 Topic2 Topic3
// 0.6364 0.2727 0.0909
// 0.5455 0.1818 0.2727
}
private static void PrintLdaFeatures(TransformedTextData prediction)
{
for (int i = 0; i < prediction.Features.Length; i++)
Console.Write($"{prediction.Features[i]:F4} ");
Console.WriteLine();
}
private class TextData
{
public string Text { get; set; }
}
private class TransformedTextData : TextData
{
public float[] Features { get; set; }
}
}
}
