# Every record contains a label and feature vector
df=spark.createDataFrame(data,["label","features"])# Split the data into train/test datasets
train_df,test_df=df.randomSplit([.80,.20],seed=42)# Set hyperparameters for the algorithm
rf=RandomForestRegressor(numTrees=100)# Fit the model to the training data
model=rf.fit(train_df)# Generate predictions on the test dataset.
model.transform(test_df).show()
df=spark.read.csv("accounts.csv",header=True)# Select subset of features and filter for balance > 0
filtered_df=df.select("AccountBalance","CountOfDependents").filter("AccountBalance > 0")# Generate summary statistics
filtered_df.summary().show()
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-sql
The most widely-used
engine for scalable computing
Thousands of
companies, including 80% of the Fortune 500, use Apache Sparkβ’. Over 2,000 contributors to
the open source project from industry and academia.
Ecosystem
Apache Sparkβ’ integrates with your favorite frameworks, helping to scale them to thousands of machines.