Skip to content

Latest commit

 

History

History
405 lines (363 loc) · 22.7 KB

gallery.md

File metadata and controls

405 lines (363 loc) · 22.7 KB

Gallery

This gallery contains some random questions about data manipulation that we found on internet. The original questions are posted in different forums and are related to different packages. Whenever, we can remember the original source of a question we provide a link to it, otherwise, we just re-asked the question as we remember it. There is no particular theme about the questions, we just found them interesting since, a) they are not trivial, b) they can be done relatively easy in InMemoryDatasets, c) our solution is more efficient than what we found in the original source.

General

julia> ds = Dataset([[1.66077, -1.05298, -0.499206, 2.47123, 2.45914, 1.14014],
                     [0.75, 0.75, 0.75, 0.75, 0.75, 0.75],
                     [0.709184, -2.53609, 0.0130659, -0.587867, 0.55786, 1.60398],
                     [0.333, 0.333, 0.333, 0.333, 0.333, 0.333],
                     [1.47438, 2.01485, 2.49006, 1.80345, 0.569928, 1.58403],
                     [1, 1, 1, 1, 1, 1],
                     [2.02678, 1.51587, 1.70535, 2.51628, 1.909, 0.794765],
                     [1.25, 1.25, 1.25, 1.25, 1.25, 1.25]],
                     ["a", "a_lim", "b", "b_lim", "c", "c_lim", "d", "d_lim"])
6×8 Dataset
 Row │ a          a_lim     b           b_lim     c         c_lim     d         d_lim    
     │ identity   identity  identity    identity  identity  identity  identity  identity
     │ Float64?   Float64?  Float64?    Float64?  Float64?  Int64?    Float64?  Float64?
─────┼───────────────────────────────────────────────────────────────────────────────────
   11.66077       0.75   0.709184      0.333  1.47438          1  2.02678       1.25
   2-1.05298       0.75  -2.53609       0.333  2.01485          1  1.51587       1.25
   3-0.499206      0.75   0.0130659     0.333  2.49006          1  1.70535       1.25
   42.47123       0.75  -0.587867      0.333  1.80345          1  2.51628       1.25
   52.45914       0.75   0.55786       0.333  0.569928         1  1.909         1.25
   61.14014       0.75   1.60398       0.333  1.58403          1  0.794765      1.25

julia> using Chain
julia> @chain ds begin
         compare(_[!, r"lim"], _[!, Not(r"lim")], cols = 1:4 .=> 1:4, eq = isless)
         byrow(count)
       end
6-element Vector{Int32}:
 4
 2
 2
 3
 3
 3

map! and map

  • How to randomly change about 10% of data values to missing?
julia> ds = Dataset(rand(10, 3), :auto)
10×3 Dataset
 Row │ x1         x2         x3        
     │ identity   identity   identity  
     │ Float64?   Float64?   Float64?  
─────┼─────────────────────────────────
   10.829492   0.266336   0.712512
   20.206569   0.252967   0.133839
   30.0773648  0.420897   0.363549
   40.404912   0.495679   0.400661
   50.412908   0.740628   0.785319
   60.624809   0.971097   0.725747
   70.19843    0.378382   0.0453478
   80.851221   0.563592   0.450065
   90.351243   0.0555308  0.142801
  100.208414   0.739952   0.926834

julia> map!(ds, x->rand()<.1 ? missing : x, :)
10×3 Dataset
 Row │ x1               x2               x3        
     │ identity         identity         identity  
     │ Float64?         Float64?         Float64?  
─────┼─────────────────────────────────────────────
   10.829492         0.266336   0.712512
   20.206569         0.252967   0.133839
   30.0773648        0.420897   0.363549
   40.404912         0.495679   0.400661
   5missing                0.740628   0.785319
   60.624809   missing          0.725747
   70.19843          0.378382   0.0453478
   8missing                0.563592   0.450065
   90.351243         0.0555308  0.142801
  100.208414         0.739952   0.926834

Row operations, byrow

  • In each row, how to replace missing values in a column by the first non-missing in previous columns. (Assuming for the first column the direction of search is reversed and all columns has the same type)
julia> ds = Dataset(rand([1,2,3, missing], 10, 6), :auto)
10×6 Dataset
 Row │ x1        x2        x3        x4        x5        x6       
     │ identity  identity  identity  identity  identity  identity
     │ Int64?    Int64?    Int64?    Int64?    Int64?    Int64?   
─────┼────────────────────────────────────────────────────────────
   11         1   missing         2   missing   missing
   22         1         3         3         2         3
   32         1   missing         3         3   missing
   41   missing   missing         1         1         1
   52         2   missing         1         3         1
   6missing   missing         2         1   missing         1
   7missing         2         3   missing         1         2
   83         3   missing         1         1         1
   93         1         1         3         1   missing
  10missing         1         3         1         1         3

julia> byrow(ds, fill!, :, with = byrow(ds, coalesce, :), rolling = true)
10×6 Dataset
 Row │ x1        x2        x3        x4        x5        x6       
     │ identity  identity  identity  identity  identity  identity
     │ Int64?    Int64?    Int64?    Int64?    Int64?    Int64?   
─────┼────────────────────────────────────────────────────────────
   11         1         1         2         2         2
   22         1         3         3         2         3
   32         1         1         3         3         3
   41         1         1         1         1         1
   52         2         2         1         3         1
   62         2         2         1         1         1
   72         2         3         3         1         2
   83         3         3         1         1         1
   93         1         1         3         1         1
  101         1         3         1         1         3
  • A use-case from practice : We have a data frame that has 10,000 rows and columns, but this time we have 50% of missing values randomly scattered in it. What we want to do is to fill missing values in each row with row means of non-missing values.
julia> ds = Dataset(rand([1.0, missing], 10_000, 10_000), :auto) .* (1:10_000);

julia> byrow(ds, fill!, :, with = byrow(ds, mean, :));
  • How to determine whether two sets of variables have a shared value : I have a data that contains two sets of variables, and I want to compare whether the two sets have the same value. Here we provide a scalable solution where the columns for the first set have "1" in their names, and the columns for the second set have "2" in their names.

since in the original post there are some specifications for checking equality of values we define a customised equality function

julia> a1 = Dataset(z1 = [1,missing,3,4,5], x1 = string.(3:7), z2 = [2,missing,4,5,6], x2 = [3,5,4,7,5])
5×4 Dataset
 Row │ z1        x1        z2        x2       
     │ identity  identity  identity  identity
     │ Int64?    String?   Int64?    Int64?   
─────┼────────────────────────────────────────
   11  3                2         3
   2missing  4          missing         5
   33  5                4         4
   44  6                5         7
   55  7                6         5

julia> eq(x, y) = isequal(x, y)
julia> eq(x::String, y) = isequal(parse(Int,x), y)
julia> eq(x, y::String) = isequal(x, parse(Int, y))
julia> eq(x::String, y::String) = isequal(parse(Int, x), parse(Int, y))
julia> eq(::Missing, ::Missing) = false
julia> a1.result = reduce((x,y) -> x .|= byrow(a1, in, r"2", item = y, eq = eq), names(a1, r"1"), init = zeros(Bool, nrow(a1)))
5-element Vector{Bool}:
 1
 0
 0
 0
 1
julia> using DLMReader, Chain
julia> ds = filereader(IOBuffer("""item,length,width,height,color,option_1,option_2,option_3
                        Box 1,2,4,6,,length,width,height
                        Tape,10,3,,clear,width,length,color
                        Pen,,,,red,color,,
                        """));

julia> @chain ds begin
         filter(2:4, by = !ismissing, type = any, view = true) # filter rows with dimensions
         modify!(Ref(2:4) .=> [byrow(select, with = 6), byrow(select, with = 7), byrow(select, with = 8)] .=> [:_t1, :_t2, :_t3],  # extract dimensions
                 r"_t" => byrow(join, delim = " × ") => :size, # create :size
                 :size => byrow(x->"size : " * replace(x, r" × $"=>"")) # remove orphans " × " at the end
                )
         filter(parent(_), 5, by = !ismissing, view = true) # filter rows with colours
         modify!(:color => byrow(x->"color : " * x) => :col) # create :col
         modify!(parent(_), [:size, :col]=>byrow(x->ismissing(x[1]) ? x[2] : ismissing(x[2]) ? x[1] : x[1] * ", " * x[2])=>:option_set) # join :size and :col
         select!(Not([:size, :col]), Not(r"_t")) # remove temp columns
      end
3×9 Dataset
 Row │ item      length    width     height    color     option_1  option_2  option_3  option_set                   
     │ identity  identity  identity  identity  identity  identity  identity  identity  identity                     
     │ String?   Int64?    Int64?    Int64?    String?   String?   String?   String?   String?                      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Box 1            2         4         6  missing   length    width     height    size : 2 × 4 × 6
   2 │ Tape            10         3   missing  clear     width     length    color     size : 3 × 10, color : clear
   3 │ Pen        missing   missing   missing  red       color     missing   missing   color : red

Filtering

julia> ds = Dataset(rand(10, 5), :auto)
10×5 Dataset
 Row │ x1          x2         x3         x4         x5        
     │ identity    identity   identity   identity   identity  
     │ Float64?    Float64?   Float64?   Float64?   Float64?  
─────┼────────────────────────────────────────────────────────
   10.00924202  0.780238   0.270591   0.340518   0.815554
   20.0653996   0.427389   0.834984   0.142269   0.632195
   30.930175    0.0556611  0.70946    0.424178   0.389117
   40.368194    0.88762    0.145652   0.0967154  0.953927
   50.616674    0.202607   0.0228603  0.347016   0.597645
   60.293204    0.570458   0.440695   0.28341    0.948588
   70.46431     0.519242   0.360305   0.243189   0.133591
   80.663426    0.756596   0.46699    0.511906   0.0340278
   90.681305    0.23287    0.35492    0.754141   0.134459
  100.0739887   0.157783   0.697812   0.421743   0.229453

julia> filter(ds, 1:2, type = isless, with = :x5, rev = true)
3×5 Dataset
 Row │ x1        x2        x3        x4        x5        
     │ identity  identity  identity  identity  identity  
     │ Float64?  Float64?  Float64?  Float64?  Float64?  
─────┼───────────────────────────────────────────────────
   10.46431   0.519242  0.360305  0.243189  0.133591
   20.663426  0.756596  0.46699   0.511906  0.0340278
   30.681305  0.23287   0.35492   0.754141  0.134459

Grouping

julia> ds = Dataset(id = [1,1,1,1,1,2,2,2,3,3,3],
                    date = Date.(["2019-03-05", "2019-03-12", "2019-04-10",
                            "2019-04-29", "2019-05-10", "2019-03-20",
                            "2019-04-22", "2019-05-04", "2019-11-01",
                            "2019-11-10", "2019-12-12"]),
                    outcome = [false, false, false, true, false, false,
                               true, false, true, true, true])
11×3 Dataset
 Row │ id        date        outcome  
     │ identity  identity    identity
     │ Int64?    Date?       Bool?    
─────┼────────────────────────────────
   11  2019-03-05     false
   21  2019-03-12     false
   31  2019-04-10     false
   41  2019-04-29      true
   51  2019-05-10     false
   62  2019-03-20     false
   72  2019-04-22      true
   82  2019-05-04     false
   93  2019-11-01      true
  103  2019-11-10      true
  113  2019-12-12      true

julia> combine(gatherby(ds, [1, 3], isgathered = true),
                        (:) => last,
                        dropgroupcols = true)
7×3 Dataset
 Row │ last_id   last_date   last_outcome
     │ identity  identity    identity     
     │ Int64?    Date?       Bool?        
─────┼────────────────────────────────────
   11  2019-04-10         false
   21  2019-04-29          true
   31  2019-05-10         false
   42  2019-03-20         false
   52  2019-04-22          true
   62  2019-05-04         false
   73  2019-12-12          true

Joins

julia> store = Dataset([Date.(["2019-10-01", "2019-10-02", "2019-10-03", "2019-10-04",
                         "2019-10-05", "2019-10-01", "2019-10-02", "2019-10-03",
                         "2019-10-04", "2019-10-05"]),
                         ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"]],
                         ["date", "store"])
julia> roster = Dataset([["A", "A", "A", "A", "B", "B", "B", "B"],
                         [1, 2, 3, 4, 5, 6, 7, 8],
                         [Date("2019-09-30"), Date("2019-10-02"), Date("2019-10-03"), Date("2019-10-04"),
                         Date("2019-09-30"), Date("2019-10-02"), Date("2019-10-03"), Date("2019-10-04")],
                         [Date("2019-10-04"), Date("2019-10-04"), Date("2019-10-05"), Date("2019-10-06"),
                         Date("2019-10-04"), Date("2019-10-04"), Date("2019-10-05"), Date("2019-10-06")]],
                         ["store", "employee_ID", "start_date", "end_date"])
julia> using Chain
julia> @chain store begin
          innerjoin(roster, on = [:store => :store, :date => (:start_date, :end_date)])
          groupby([:store, :date])
          combine(:employee_ID => length)
       end
10×3 Dataset
 Row │ store     date        length_employee_ID
     │ identity  identity    identity           
     │ String?   Date?       Int64?             
─────┼──────────────────────────────────────────
   1 │ A         2019-10-01                   1
   2 │ A         2019-10-02                   2
   3 │ A         2019-10-03                   3
   4 │ A         2019-10-04                   4
   5 │ A         2019-10-05                   2
   6 │ B         2019-10-01                   1
   7 │ B         2019-10-02                   2
   8 │ B         2019-10-03                   3
   9 │ B         2019-10-04                   4
  10 │ B         2019-10-05                   2

Reshape

julia> ds = Dataset("Country_Code"=>["FR","FR","FR","USA","USA","USA","BR","BR","BR"],
                    "Indicator_Name"=>["GPD","Pop","birth","GPD","Pop","birth","GPD","Pop","birth"],
                    "2005"=>[14,34,56, 25, 67, 68, 55, 8,99],
                    "2006"=>[23, 34, 34, 43,34,34, 65, 34,45])
9×4 Dataset
 Row │ Country_Code  Indicator_Name  2005      2006     
     │ identity      identity        identity  identity
     │ String?       String?         Int64?    Int64?   
─────┼──────────────────────────────────────────────────
   1 │ FR            GPD                   14        23
   2 │ FR            Pop                   34        34
   3 │ FR            birth                 56        34
   4 │ USA           GPD                   25        43
   5 │ USA           Pop                   67        34
   6 │ USA           birth                 68        34
   7 │ BR            GPD                   55        65
   8 │ BR            Pop                    8        34
   9 │ BR            birth                 99        45

julia> transpose(gatherby(ds, 1), 3:4, id = r"Name")
6×5 Dataset
 Row │ Country_Code  _variables_  GPD       Pop       birth    
     │ identity      identity     identity  identity  identity
     │ String?       String?      Int64?    Int64?    Int64?   
─────┼─────────────────────────────────────────────────────────
   1 │ FR            2005               14        34        56
   2 │ FR            2006               23        34        34
   3 │ USA           2005               25        67        68
   4 │ USA           2006               43        34        34
   5 │ BR            2005               55         8        99
   6 │ BR            2006               65        34        45

for loops

julia> using Chain
julia> ds = Dataset( [["Becks", "307NRR", "321NRR", "342NRR", "Campbell", "329NRR", "347NRR", "Crows", "C3001R"],
                     [missing, "R", "R", "R", missing, "R", "R", missing, "R"],
                     [missing, "CM,SG", "CM,SG", "CM,SG", missing, "None", "None", missing, "None"],
                     [missing, 3.0, 3.2, 3.4, missing, 3.2, 3.4, missing, 3.0]], :auto)
9×4 Dataset
 Row │ x1        x2        x3        x4        
     │ identity  identity  identity  identity  
     │ String?   String?   String?   Float64?  
─────┼─────────────────────────────────────────
   1 │ Becks     missing   missing   missing   
   2307NRR    R         CM,SG           3.0
   3321NRR    R         CM,SG           3.2
   4342NRR    R         CM,SG           3.4
   5 │ Campbell  missing   missing   missing   
   6329NRR    R         None            3.2
   7347NRR    R         None            3.4
   8 │ Crows     missing   missing   missing   
   9 │ C3001R    R         None            3.0

julia> function replace_with_prev(x,y)
           res = similar(x, length(x))
           for i in 1:length(x)
               if !ismissing(y[i])
                   res[i] = res[i-1]
               else
                   res[i] = x[i]
               end
           end
           res
       end
f1 (generic function with 2 methods)
julia> @chain ds begin
         modify!((1,2)=>replace_with_prev=>:name) # find previous name
         dropmissing!(2) # drop unwanted rows
         select!(:name, :) # rearrange columns
      end
6×5 Dataset
 Row │ name      x1        x2        x3        x4       
     │ identity  identity  identity  identity  identity
     │ String?   String?   String?   String?   Float64?
─────┼──────────────────────────────────────────────────
   1 │ Becks     307NRR    R         CM,SG          3.0
   2 │ Becks     321NRR    R         CM,SG          3.2
   3 │ Becks     342NRR    R         CM,SG          3.4
   4 │ Campbell  329NRR    R         None           3.2
   5 │ Campbell  347NRR    R         None           3.4
   6 │ Crows     C3001R    R         None           3.0