Supplement 1. R code and the data set necessary to conduct the Random Forest analysis.

File List

dreissena_in_lakes_of_belarus.csv (MD5: 3dc2d2f89af3064223358983c785771d)

r_script_random_forest.R (MD5: af1295890d60bc832955e940889e4575)


This Supplementary material contains two files necessary to fully reproduce the results obtained using the Random Forest classifier. The first of these files, dreissena_in_lakes_of_belarus.csv, is a plain text table that has 553 records, each described with the following variables:

1. Lake_Code: numeric codes uniquely identifying each lake (for reference only, not used in analysis explicitely).

2. ZMpresence: indicator of whether a lake is infested with zebra mussel (0 – for non-infested, 1 – for infested).

3. LAREA: lake area

4. LVOL: lake volume

5. MAXD: maximal depth

6. AVED: average depth

7. SPECWATSHED: specific watershed (i.e., drainage area)

8. TRANSP: Secci depth

9. COLOR: water color

10. pH: water pH

11. HCO3: HCO3 content

12. SO4: SO4 content

13. Cl: CL content

14. Ca: Ca content

15. Mg: Mg content

16. TDS: total dissolved solids

17: Fe: Fe content

18. Si: Si content

19. NH4: NH4 content

20. NO2: NO2 content

21. PO4: PO4 content

22. PermOx: permanganate oxydizability

23. N: latitude (decimal degree)

24: E: longitude (decimal degree)

Missing values in the data set are denoted as NA.

The second file, r_script_random_forest.R, loads the data into R (assuming that the file dreissena_in_lakes_of_belarus.csv is stored in the current R working directory), fits the Random Forest model, and plots the results. The analysis relies on three add-on packages: caret, geosphere, randomForest, and ggplot2. All these packages are assumed to be already installed on the user's computer (if not, they can be freely downloaded from the Comprehensive R Archive Network,, or installed directly from within R using the following command: install.packages(c("caret", "geosphere", "randomForest", "ggplot2"))).