Here's a puzzle for you: from within the language, how do you list the sizes of the built-in datasets in #Rstats or #Python?
I'm new enough to R that I wasn't able to figure that one out.
Raw Python doesn't have built-in datasets, but sklearn does. (There are more in Seaborn and it's possible to load the R sets from Python.) Here's my first shot; the multiple special cases smell bad to me:
from sklearn import datasetsloaders = [f for f in dir(datasets) if f.startswith('load') and f != 'load_boston' and f != 'load_files' and not f.startswith('load_sample_image') and not f.startswith('load_svmlight_file')][(f[5:], len(eval('datasets.' + f + '().data'))) for f in loaders]
Output:
[('breast_cancer', 569), ('diabetes', 442), ('digits', 1797), ('iris', 150), ('linnerud', 20), ('wine', 178)]
My second pass is easier to read, but it squelches errors (gasp!) and still shows you the warning about the deprecated boston dataset:
from sklearn import datasets
names = []
sizes = []
for f in dir(datasets):
try:
sizes.append(eval('len(datasets.' + f + '().data)'))
names.append(f[5:])
except:
pass
print(list(zip(names, sizes)))
This is something of a weird reflective programming task in both languages, so neither can be blamed for inelegance.
@hhmacedo Perfect!
@hhmacedo I'm curious as to how the iris dataset comes to have a "length" of 600. Yes, there are 150 rows, each with 4 features, but what about the "Species" column?
@peterdrake iris and iris3 are organised in different ways, the former has a column Species (5*150=750) while the later is a 3d dataset of 4*50 observations, with Species being the third dimension (4*50*3=600).
@hhmacedo "Ah-h-h," as Frank Herbert would say, "-h-h-h-h-h-h-h."
Thanks!
@peterdrake I'm sorry but I don't know why I changed from "datasets" to "stats". And changing from memory size to the length, the function would be:
ls(as.environment("package:datasets")) |>
sapply(function(x) { length(unlist(get(x, as.environment("package:datasets")))) } )