Friday, October 12, 2018

A goldmine

https://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/

https://raw.githubusercontent.com/asadoughi/stat-learning/master/ch2/answers

http://www-bcf.usc.edu/~gareth/ISL/



Monday, October 8, 2018

Saturday, October 6, 2018

Transitioning towards a data science career

These two articles made for good reading/ reference

Rafael Knuth blogged about his transition into a data science career. He inspired someone else to follow his footsteps, & write about his experience.



Comparing read_csv with spark_read_csv

Reading in a csv file into R using dplyr's `read_csv()` function is so simple. The syntax & parameters of dplyr are fairly easy to remember, once you've done it a few times.

read_csv(file, 
    col_names = TRUE, 
    col_types = NULL,
    locale = default_locale(),
    na = c("", "NA"), 
    quoted_na = TRUE,
    quote = "\"", 
    comment = "", 
    trim_ws = TRUE, 
    skip = 0, n_max = Inf,
    guess_max = min(1000, n_max), 
    progress = show_progress()
)



I've only just started working with big data sets, & was began wondering if what I know about the dplyr syntax can be carried over to sparklyr's spark_read_csv() function.

While not exactly the same, but if you know one, you can quite easily pick the other. There's an additional parameter `sc`, aka spark connection, that's required.

spark_read_csv(
    sc, 
    name,
    path, 
    header = TRUE, # FALSE forces a "V_" prefix
    columns = NULL,
    infer_schema = TRUE, # to infer column data type
    delimiter = ",", 
    quote = "\"", 
    escape = "\\",
    charset = "UTF-8", 
    null_value = NULL,
    options = list(),
    repartition = 0, # number of partitions to distribute the generated table.
    memory = TRUE, 
    overwrite = TRUE, ...
)

Tuesday, October 2, 2018