Help for package sboost

Type:

Package

Title:

Machine Learning with AdaBoost on Decision Stumps

Version:

0.1.2

Description:

Creates classifier for binary outcomes using Adaptive Boosting (AdaBoost) algorithm on decision stumps with a fast C++ implementation. For a description of AdaBoost, see Freund and Schapire (1997) <doi:10.1006/jcss.1997.1504>. This type of classifier is nonlinear, but easy to interpret and visualize. Feature vectors may be a combination of continuous (numeric) and categorical (string, factor) elements. Methods for classifier assessment, predictions, and cross-validation also included.

License:

MIT + file LICENSE

URL:

https://github.com/jadonwagstaff/sboost

BugReports:

https://github.com/jadonwagstaff/sboost/issues

Encoding:

UTF-8

LazyData:

true

Depends:

R (≥ 3.4.0)

LinkingTo:

Rcpp (≥ 0.12.17)

Imports:

dplyr (≥ 0.7.6), rlang (≥ 0.2.1), Rcpp (≥ 0.12.17), stats (≥ 3.4)

RoxygenNote:

7.1.2

Suggests:

testthat

NeedsCompilation:

yes

Packaged:

2022-05-25 21:11:26 UTC; Home

Author:

Jadon Wagstaff [aut, cre]

Maintainer:

Jadon Wagstaff <jadonw@gmail.com>

Repository:

CRAN

Date/Publication:

2022-05-26 13:10:02 UTC

sboost Assessment Function

Description

Assesses how well an sboost classifier classifies the data.

Usage

assess(object, features, outcomes, include_scores = FALSE)

Arguments

object

sboost_classifier S3 object output from sboost.

features

feature set data.frame.

outcomes

outcomes corresponding to the features.

include_scores

if true feature_scores are included in output.

Value

An sboost_assessment S3 object containing:

performance: Last row of cumulative statistics (i.e. when all stumps are included in assessment).
cumulative_statistics: stump - the index of the last decision stump added to the assessment.
true_positive - number of true positive predictions.
false_negative - number of false negative predictions.
true_negative - number of true negative predictions.
false_positive - number of false positive predictions.
prevalence - true positive / total.
accuracy - correct predictions / total.
sensitivity - correct predicted positive / true positive.
specificity - correct predicted negative / true negative.
ppv - correct predicted positive / predicted positive.
npv - correct predicted negative / predicted negative.
f1 - harmonic mean of sensitivity and ppv.
feature_scores: If include_scores is TRUE, for each feature in the classifier lists scores for each row in the feature set.
classifier: sboost sboost_classifier object used for assessment.
outcomes: Shows which outcome was considered as positive and which negative.
call: Shows the parameters that were used for assessment.

Examples

# malware
malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1)
assess(malware_classifier, malware[-1], malware[1])

# mushrooms
mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p")
assess(mushroom_classifier, mushrooms[-1], mushrooms[1])

Malware System Calls

Description

System call data for apps identified as malware and not malware.

Usage

malware

Format

A data frame with 7597 rows and 361 variables: outcomes 1 if malware, 0 if not. X1... X360 system calls.

Details

Experimental data generated in this research paper:

M. Dimjašević, S. Atzeni, I. Ugrina, and Z. Rakamarić, "Evaluation of Android Malware Detection Based on System Calls," in Proceedings of the International Workshop on Security and Privacy Analytics (IWSPA), 2016.

Data used for kaggle competition: https://www.kaggle.com/c/ml-fall2016-android-malware

Source

https://zenodo.org/record/154737#.WtoA1IjwaUl

Mushroom Classification

Description

A classic machine learning data set describing hypothetical samples from the Agaricus and Lepiota family.

Usage

mushrooms

Format

A data frame with 7597 rows and 361 variables:

outcomes: p=poisonous, e=edible
cap_shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
cap_surface: fibrous=f, grooves=g, scaly=y, smooth=s
cap_color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
bruises: bruises=t, no=f
odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
gill_attachment: attached=a, descending=d, free=f, notched=n
gill_spacing: close=c, crowded=w, distant=d
gill_size: broad=b, narrow=n
gill_color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
stalk_shape: enlarging=e, tapering=t
stalk_root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
stalk_surface_above_ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk_surface_below_ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk_color_above_ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
stalk_color_below_ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
veil_type: partial=p, universal=u
veil_color: brown=n, orange=o, white=w, yellow=y
ring_number: none=n, one=o, two=t
ring_type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
spore_print_color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

Details

Data gathered from:

Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf

Source

https://archive.ics.uci.edu/ml/datasets/mushroom

Make predictions for a feature set based on an sboost classifier.

Description

Make predictions for a feature set based on an sboost classifier.

Usage

## S3 method for class 'sboost_classifier'
predict(object, features, scores = FALSE, ...)

Arguments

object

sboost_classifier S3 object output from sboost.

features

feature set data.frame.

scores

if true, raw scores generated; if false, predictions are generated.

...

further arguments passed to or from other methods.

Value

Predictions in the form of a vector, or scores in the form of a vector. The index of the vector aligns the predictions or scores with the rows of the features. Scores represent the sum of all votes for the positive outcome minus the sum of all votes for the negative outcome.

Examples

# malware
malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1)
predict(malware_classifier, malware[-1], scores = TRUE)
predict(malware_classifier, malware[-1])

# mushrooms
mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p")
predict(mushroom_classifier, mushrooms[-1], scores = TRUE)
predict(mushroom_classifier, mushrooms[-1])

sboost Learning Algorithm

Description

A machine learning algorithm using AdaBoost on decision stumps.

Usage

sboost(features, outcomes, iterations = 1, positive = NULL, verbose = FALSE)

Arguments

features

feature set data.frame.

outcomes

outcomes corresponding to the features.

iterations

number of boosts.

positive

the positive outcome to test for; if NULL, the first outcome in alphabetical (or numerical) order will be chosen.

verbose

If true, progress bar will be displayed in console.

Details

Factors and characters are treated as categorical features. Missing values are supported.

See https://jadonwagstaff.github.io/projects/sboost.html for a description of the algorithm.

For original paper describing AdaBoost see:

Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119-139 (1997)

Value

An sboost_classifier S3 object containing:

classifier: stump - the index of the decision stump
feature - name of the column that this stump splits on.
vote - the weight that this stump has on the final classifier.
orientation - shows how outcomes are split. If feature is numeric shows split orientation, if feature value is less than split then vote is cast in favor of left side outcome, otherwise the vote is cast for the right side outcome. If feature is categorical, vote is cast for the left side outcome if feature value is found in left_categories, otherwise vote is cast for right side outcome.
split - if feature is numeric, the value where the decision stump splits the outcomes; otherwise, NA.
left_categories - if feature is categorical, shows the feature values that sway the vote to the left side outcome on the orientation split; otherwise, NA.
outcomes: Shows which outcome was considered as positive and which negative.
training: stumps - how many decision stumps were trained.
features - how many features the training set contained.
instances - how many instances or rows the training set contained.
positive_prevalence - what fraction of the training instances were positive.
call: Shows the parameters that were used to build the classifier.

Examples

# malware
malware_classifier <- sboost(malware[-1], malware[1], iterations = 5, positive = 1)
malware_classifier
malware_classifier$classifier

# mushrooms
mushroom_classifier <- sboost(mushrooms[-1], mushrooms[1], iterations = 5, positive = "p")
mushroom_classifier
mushroom_classifier$classifier

sboost Validation Function

Description

A k-fold cross validation algorithm for sboost.

Usage

validate(
  features,
  outcomes,
  iterations = 1,
  k_fold = 6,
  positive = NULL,
  verbose = FALSE
)

Arguments

features

feature set data.frame.

outcomes

outcomes corresponding to the features.

iterations

number of boosts.

k_fold

number of cross-validation subsets.

positive

is the positive outcome to test for; if NULL, the first in alphabetical order will be chosen

verbose

If true, progress bars will be displayed in console.

Value

An sboost_validation S3 object containing:

performance: Final performance statistics for all stumps.
training_summary_statistics: Mean and standard deviations for test statistics generated by assess cumulative statistics for each of the training sets.
testing_summary_statistics: Mean and standard deviations for test statistics generated by assess cumulative statistics for each of the testing sets.
training_statistics: sboost sboost_assessment cumulative statistics objects used to generate training_statistics.
testing_statistics: sboost sboost_assessment cumulative statistics objects used to generate testing_statistics.
classifier_list: sboost sboost_classifier objects created from training sets.
outcomes: Shows which outcome was considered as positive and which negative.
k_fold: number of testing and training sets used in the validation.
call: Shows the parameters that were used for validation.

Examples

# malware
validate(malware[-1], malware[1], iterations = 5, k_fold = 3, positive = 1)

# mushrooms
validate(mushrooms[-1], mushrooms[1], iterations = 5, k_fold = 3, positive = "p")

sboost Assessment Function

Description

Usage

Arguments

Value

See Also

Examples

Malware System Calls

Description

Usage

Format

Details

Source

Mushroom Classification

Description

Usage

Format

Details

Source

Make predictions for a feature set based on an sboost classifier.

Description

Usage

Arguments

Value

See Also

Examples

sboost Learning Algorithm

Description

Usage

Arguments

Details

Value

See Also

Examples

sboost Validation Function

Description

Usage

Arguments

Value

See Also

Examples