-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
loading my own datasets #3808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The documentation of sklearn is really very useful and should answer your question: http://scikit-learn.org (basically you have to put your data in numpy arrays) |
This is something that could have a bit more documentation than is in there On 29 October 2014 09:45, Alexander Fabisch notifications@github.com
|
@jnothman Should we reopen this issue and add a new section in the documentation? For example in this section: http://scikit-learn.org/stable/tutorial/basic/tutorial.html ("Loading your own data"). |
see #2801 On 29 October 2014 18:07, Alexander Fabisch notifications@github.com
|
My own dataset means the dataset that I have collected by my self, not the I have a simple csv file and I on my desktop and I want to load it inside I need a very simple and easy way to do so. I will be highly appreciated any useful advice. On 29 October 2014 15:25, jnothman notifications@github.com wrote:
|
See http://pandas.pydata.org/pandas-docs/stable/io.html On 29 October 2014 21:15, MartinLion notifications@github.com wrote:
|
Thanks for the link. I checked it out, but the process looks complicated. On 29 October 2014 19:12, jnothman notifications@github.com wrote:
|
It probably looks something like: import pandas as pd Then fit a scikit-learn estimatorSVC().fit(data, target) On 29 October 2014 23:19, MartinLion notifications@github.com wrote:
|
You could also have a look at |
Hi jnothman, I tried applying your code. Thus, once I interned (import pandas as pd). Directly I had the following message in red color: import pandas as pd What should I do? |
It just means you do not have the dateutil module installed. You can install it by doing
hth |
You can have a look at this for more details, http://stackoverflow.com/questions/20853474/importerror-no-module-named-dateutil-parser |
Thanks MechCoder for your contribution. I tried "sudo apt-get install python-dateutil", but it is not clear to me at what stage should indicate this code? Is there any youtube tutorial about loading dataset (not iris which is everywhere or other famous. stuff). Video is easy than links |
HI all, import pandas as pd and i got this error: could you plz guide me. |
We could tell you what the problem is but I think in this case you will learn more from it if you find it on your own. You should read the error message carefully. It is a Python syntax error.
|
On 22 Aug 2015 08:33, "samira afzal" notifications@github.com wrote:
I recommend you finding another tool where you can work with easily without Good luck
|
To be clear, these previous posters are saying that being somewhat On 22 August 2015 at 18:04, MartinLion notifications@github.com wrote:
|
Just want to support @MartinLion --- I am a scikit-learn newbie and have just have spent a frustrating time going thought the docs, and I can't find anywhere how to read my own data (and not a prepared toy dataset), and what the python format of data is. |
• How do I load my data to work with scikit-learn?
• How to load data from CSV file?
We should add these in the FAQ.
|
should we instead add as a section in the tutorial below/above "Loading an example dataset"? |
Also could you tag this "Question", "Documentation" and reopen it? |
We should reference it. But I don't see this as tutorial material because That's an answer that the users really don't want to hear, because there I guess that we should have a sentence like this in the tutorial, where As a side note, the kind of errors hit by the users on the thread of this |
Well, take it easy!!! I don't know whether you are one of scikit-learn staff or not, but I need First reason, criticizing people (like what you did) and assuming that they Second, we can understand from your unsuitable way of talking that you On the other hand, regarding the question "should we instead add as a Nevertheless, for those who are still struggling with scikit-learn, I would Thanks David van Leeuwen for your support. Good luck in your analysis. Cheers,
|
Hey Martin, Kindly don't be offended. He did not criticize :) He, being one of the top contributors to scikit-learn has to make tough decisions as to what will go into our codebase and what will not, as a more verbose documentation or tutorial might not be preferable for a lot of people. Gael has in fact contributed a lot of user guides himself to scikit learn to help users. The reason why he was opposing that addition to the tutorial was that there are multitude of ways in which users have their data stored and such a user guide on how to get the input data from all of them (a text file/csv file/database/zipped archive), is indeed out of scope for scikit learn, which is a machine learning library. The most important thing to note here is that it is very clearly explained by documentations of libraries which handle data, like numpy or pandas. It is expected from the user that he or she knows this! Since it seems to not be very clear, he suggests that we add a FAQ, pointing the user to such userguides, which are more elaborate than we could possibly get :) It may appear that our tutorial could be a bit more elaborate on how the inputs are obtained. But the thing, in general, with userguides is that, it could always be a little bit more elaborate, which makes us set a hard limit on how detailed our userguides can get, to help contain the userguide in a maintainable format :) If you think from that perspective, you yourself would understand our situation. As this issue is open someone will indeed send a PR soon adding a nice FAQ entry and an example, maybe, which could help clarify your (or any other new user's) doubts on input formats. Cheers! |
Hello @MartinLion , we understand your eagerness to solve your problem, and your frustration when it is not solved. However, you seem quite misinformed about what is scikit-learn, how it works, and how the project is developed. Therefore, I would like to make some points clear for you. As you can see from What I would like to emphasize is that there is no such thing as a scikit-learn "product", or scikit-learn "staff" (only a handful of people have worked full time on the project). You mention "we as users are Also, although users' needs are indeed a top priority of scikit-learn (it has an amazing documentation, of which most scientific Python packages can be jealous!), each software addresses a well-targeted niche of users, and it is just normal that scikit-learn cannot fit all users. For example, it is preferable to use scikit-learn with already a good knowledge of Scientific Python. So, I'm really glad that you found a So, folks, let's all show some good will and keep a constructive dialog. |
Maybe we should make clear that scikit-learn is a Python library. It does not have the same scope as WEKA or RapidMiner. It fits perfectly into the scientific Python ecosystem but you should be willing to write code if you want to use it. |
Perhaps I should elaborate on my original frustration, to give you some context. I've been programming in Python almost exclusively for a year now (I am a late convert), and am fairly familiar with the ecosystem---I've done lot's of webservice related things, but also manipulation of resources related to automatic speech recognition. I do my scientific work in Julia since a couple of years, and before that, in R, octave, c++/c (some 30 years in total). The Julia ecosystem is quite dynamic, and it is all very exciting, but Python just has this very large ecosystem and very clean coding, which makes it very attractive to use for little side experiments. This time I had to do some topic classification of (single sentence) text documents. Now there is an abundant choice of language technology tools in Python, and I believe that via lda I got to scikit-learn. Great tutorials, lovely datasets and all, but I found it very difficult to find out how to organize my own data so that I could load this in. Just now, I browsed through the user guide again to find the docs for "load_files", but I could't find an entry. So a google search for "sklearn.datasets.load_files" got me there just now, and I happened to remember the particular module path from more painstaking searches yesterday (it is mentioned somewhere in a tutorial). For me, the essential information would have been: "Organize your data one document per file, one directory per class"---more or less what's under the documentation for load_files. This all makes perfect sense, but I come from a community where usual formats are "one datapoint per line", often with the class label on that line. But having said all this, I am pretty impressed how the Python (text) community has standardized data representation, from what I've seen so far. But perhaps because of the widely used standard data representation, this aspect has naturally less attention in documentation. As a final note, whenever I try to teach students how to use some scientific tool set or another, I have to spend quite some time on "how to import your data". Nobody likes to do it, it can be a lot of effort for what you potentially use only once, and is therefore always a difficult threshold. |
I agree that loading data is a difficult and important thing. However, it is a domain-specific problem. You have a particular type of data. I have another. My data is medical images of brain activity. I can tell you how I organize my data and load them. I can even tell you that we have written a whole package about this, with its own documentation. But that will probably not help you. What you want is something that tells you how to organize and load your data. Now, it may be that your data is something fairly classic, that many people have; for instance tabular data most often stored in CSV files. In which case there is a need for a package doing this data loading. I don't believe that it should be in scikit-learn. It needs to be in a package that is specialized for this data. For instance, we are not going to put image processing in scikit-learn. For tabular data, the dedicated package is pandas; as I mentioned in my reply we need to point to it. We, the scikit-learn team, want to make plugin pandas into scikit-learn easier. But it is not as easy as it may seem and it takes time (one of our core devs is prototyping something). I realize rereading your post that your data is most likely text documents. So my two examples of data (medical images and tabular data) were both wrong :$. Maybe the documentation on processing text with scikit-learn could indeed be improved and touch a bit on data organization. I don't know, I very seldom process text. But if you want to do add a few words on this, you are most welcomed to do a pull request. Anyhow, this illustrate my point about the diversity of the data: this whole thread is mostly about loading CSV files, as can be seen from earlier comments (before the thread exploded into a rant). The important thing is not the "CSV", which is the container, but the data model that underlies a CSV file. This data model is that of columns of data with different nature. It's a very different data model than processing text documents. And finally, you are unhappy that teaching people "how to import your data" is time consuming. I don't think that there is an easy fix for this, even in a specific domain. The reason being that data meaning (ie data semantics) is still very much an open area. It's intrinsically hard to describe what the data means and how it's organized. You can try a simple experience: grab a dataset from someone you don't know, about an experiment you don't know, and try understanding it. Not even loading it, just understanding it. I am sure that it will take time. What takes a human time tends to be very difficult for a computer. |
Hm I don't think we added pointers to the FAQ yet. It's certainly a FAQ. |
I wrote a short tutorial on how to get the dataset from a text format to a pandas DataFrame for use by sklearn here http://cis399-he.tumblr.com/post/151024047044/load-your-data-to-scikit-learn |
fixed in #7516 |
Well you could add some information about common errors that happen when loading own data. For example I have bumped on
I'd recommend adding section summarizing assumptions about data, for example it was sheer luck that I found information that data sets may not contain NaN's Nulls etc. Also, what should be the parameters of a numpy array and how should pandas dataframe look like. |
Perhaps that error message could be clearer, but I think passing regression targets to a classifier (as in your #7801) is a usage error nothing to do with loading your own dataset. |
@MartinLion @MartinLion And Thank you all developers, @MartinLion has spirit of learning, I praise that. |
Thanks for the kind words. Please let me know if you still have any issues with your data or learning Regards, On 20 November 2016 at 12:47, tursunwali notifications@github.com wrote:
|
Hi all,
I am very new in scikit-learn.
My questions is: how to download my own dataset (csv file).
I will be highly appreciated any answers.
Thanks.
Martin
The text was updated successfully, but these errors were encountered: