Skip to content

Commit 2aa3490

Browse files
committed
Add solution to exercise 0006
1 parent 2bd4a78 commit 2aa3490

File tree

7 files changed

+308
-0
lines changed

7 files changed

+308
-0
lines changed

burun/exercise_0006/Stop Words.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
a hence see able her seeing about here seem above hereafter seemed abroad hereby seeming according herein seems accordingly here's seen across hereupon self actually hers selves adj herself sensible after he's sent afterwards hi serious again him seriously against himself seven ago his several ahead hither shall ain't hopefully shan't all how she allow howbeit she'd allows however she'll almost hundred she's alone i should along i'd shouldn't alongside ie since already if six also ignored so although i'll some always i'm somebody am immediate someday amid in somehow amidst inasmuch someone among inc something amongst inc. sometime an indeed sometimes and indicate somewhat another indicated somewhere any indicates soon anybody inner sorry anyhow inside specified anyone insofar specify anything instead specifying anyway into still anyways inward sub anywhere is such apart isn't sup appear it sure appreciate it'd t appropriate it'll take are its taken aren't it's taking around itself tell as i've tends a's j th aside just than ask k a thank asking keep thanks associated keeps thanx at kept that available know that'll away known thats awfully knows that's b l that've back last the backward lately their backwards later theirs be latter them became latterly themselves because least then become less thence becomes lest there becoming let thereafter been let's thereby before like there'd beforehand liked therefore begin likely therein behind likewise there'll being little there're believe look theres below looking there's beside looks thereupon besides low there've best lower these better ltd they between m they'd beyond made they'll both mainly they're brief make they've but makes thing by many things c may think came maybe third can mayn't thirty cannot me this cant mean thorough can't meantime thoroughly caption meanwhile those cause merely though causes might three certain mightn't through certainly mine throughout changes minus thru clearly miss thus c'mon more till co moreover to co. most together com mostly too come mr took comes mrs toward concerning much towards consequently must tried consider mustn't tries considering my truly contain myself try containing n trying contains name t's corresponding namely twice could nd two couldn't near u course nearly un c's necessary under currently need underneath d needn't undoing dare needs unfortunately daren't neither unless definitely never unlike described neverf unlikely despite neverless until did nevertheless unto didn't new up different next upon directly nine upwards do ninety us does no use doesn't nobody used doing non useful done none uses don't nonetheless using down noone usually downwards no-one v during nor value e normally various each not versus edu nothing very eg notwithstanding via eight novel viz eighty now vs either nowhere w else o want elsewhere obviously wants end of was ending off wasn't enough often way entirely oh we especially ok we'd et okay welcome etc old well even on we'll ever once went evermore one were every ones we're everybody one's weren't everyone only we've everything onto what everywhere opposite whatever ex or what'll exactly other what's example others what've except otherwise when f ought whence fairly oughtn't whenever far our where farther ours whereafter few ourselves whereas fewer out whereby fifth outside wherein first over where's five overall whereupon followed own wherever following p whether follows particular which for particularly whichever forever past while former per whilst formerly perhaps whither forth placed who forward please who'd found plus whoever four possible whole from presumably who'll further probably whom furthermore provided whomever g provides who's get q whose gets que why getting quite will given qv willing gives r wish go rather with goes rd within going re without gone really wonder got reasonably won't gotten recent would greetings recently wouldn't h regarding x had regardless y hadn't regards yes half relatively yet happens respectively you hardly right you'd has round you'll hasn't s your have said you're haven't same yours having saw yourself he say yourselves he'd saying you've he'll says z hello second zero help secondly
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Significance
2+
3+
Scientists’ productivity usually is measured with a single metric, such as number of articles published. Here, we study two dimensions of scientists’ knowledge contributions in 10-y publication records: their depth and their breadth. Study 1 shows that scientists view pursuing a deeper research project to be more attractive than pursuing a broader project; for example, scientists viewed broad projects as riskier and less important than deeper projects. Study 2 shows that scientists’ personal dispositions predict the aggregated depth vs. breadth of their published articles. Armed with such knowledge, scientists can strategically consider the desired nature of their research portfolios, criteria for choosing and designing research projects, how to compose research teams, and the inhibitors and facilitators of boundary-crossing research.
4+
5+
6+
Scientific journal publications, and their contributions to knowledge, can be described by their depth (specialized, domain-specific knowledge extensions) and breadth (topical scope, including spanning multiple knowledge domains). Toward generating hypotheses about how scientists’ personal dispositions would uniquely predict deeper vs. broader contributions to the literature, we assumed that conducting broader studies is generally viewed as less attractive (e.g., riskier) than conducting deeper studies. Study 1 then supported our assumptions: the scientists surveyed considered a hypothetical broader study, compared with an otherwise-comparable deeper study, to be riskier, a less-significant opportunity, and of lower potential importance; they further reported being less likely to pursue it and, in a forced choice, most chose to work on the deeper study. In Study 2, questionnaire measures of medical researchers’ personal dispositions and 10 y of PubMed data indicating their publications’ topical coverage revealed how dispositions differentially predict depth vs. breadth. Competitiveness predicted depth positively, whereas conscientiousness predicted breadth negatively. Performance goal orientation predicted depth but not breadth, and learning goal orientation contrastingly predicted breadth but not depth. Openness to experience positively predicted both depth and breadth. Exploratory work behavior (the converse of applying and exploiting one’s current knowledge) predicted breadth positively and depth negatively. Thus, this research distinguishes depth and breadth of published knowledge contributions, and provides new insights into how scientists’ personal dispositions influence research processes and products.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
However, let me indulge you for a moment and assume that somebody actually does care about the finding, possibly someone who is not a scientist. In the worst possible case, they could be a politician. By all that is sacred, someone should look into it then and find out what’s going on! But in order to do so, you need to have a good theory, or at least a viable alternative hypothesis, not the null. If you are convinced something isn’t true, show me why. It does not suffice to herald each direct non-replication as evidence that the original finding was a false positive because in reality these kind of discussions are like this.
2+
3+
“At that point you have an incorrect scientific record.”
4+
5+
Honestly, this statement summarizes just about everything that is wrong with the Crusade for True Science. The problem is not that there may be mistakes in the scientific record but the megalomaniac delusion that there is such a thing as a “correct” scientific record. Science is always wrong. It’s inherent to the process to be wrong and to gradually self-correct.
6+
7+
As I said above, the scientific record is full of false positives because this is how it works. Fortunately, I think in the vast majority of false positives in the record are completely benign. They will either be corrected or they will pass into oblivion. The false theories that I worry about are the ones that most sane scientists already reject anyway: creationism, climate change denial, the anti-vaccine movement, Susan Greenfield’s ideas about the modern world, or (to stay with present events) the notion that you can “walk off your Parkinson’s.” Ideas like these are extremely dangerous and they have true potential to steer public policy in a very bad direction.
8+
9+
In contrast, I don’t really care very much whether priming somebody with the concept of a professor makes them perform better at answering trivia questions. I personally doubt it and I suspect simpler explanations (including that it could be completely spurious) but the way to prove that is to disprove that the original result could have occurred, not to show that you are incapable of reproducing it. If that sounds a lot more difficult than to churn out one failed replication after another, then that’s because it is!
10+
11+
“Also, saying “given sufficient time and resources science will self-correct” is a statement that is very easy to use to wipe all problems with current day science under the rug: nothing to see here, move along, move along… “
12+
13+
Nothing is being swept under any rugs here. For one thing, I remain unconvinced by the so-called evidence that current day science has a massive problem. The Schoens and Stapels don’t count. There have always been scientific frauds and we really shouldn’t even be talking about the fraudsters. So, ahm, sorry for bringing them up.
14+
15+
The real issue that has all the Crusaders riled up so much is that the current situation apparently generates a far greater proportion of false positives than is necessary. There is a nugget of truth to this notion but I think the anxiety is misplaced. I am all in favor of measures to reduce the propensity of false positives through better statistical and experimental practices. More importantly, we should reward good science rather than sensational science.
16+
17+
This is why the Crusaders promote preregistration – however, I don’t think this is going to help. It is only ever going to cure the symptom but not the cause of the problem. The underlying cause, the actual sickness that has infected modern science, is the misguided idea that hypothesis-driven research is somehow better than exploratory science. And sadly, this sickness plagues the Crusaders more than anyone. Instead of preregistration, which – despite all the protestations to the contrary – implicitly places greater value on “purely confirmatory research” than on exploratory science, what we should do is reward good exploration. If we did that instead of insisting that grant proposals list clear hypotheses, “anticipating” results in our introduction sections, and harping on about preregistered methods, and if we were also more honest about the fact that scientific findings and hypotheses are usually never really fully true and we did a better job communicating this to the public, then current day science probably wouldn’t have any of these problems.
18+
19+
“We scientists know what’s best, don’t you worry your pretty little head about it…”
20+
21+
Who’s saying this? The whole point I have been arguing is that scientists don’t know what’s best. What I find so exhilarating about being a scientist is that this is a profession, quite possibly the only profession, in which you can be completely honest about the fact that you don’t really know anything. We are not in the business of knowing but in asking better questions.
22+
23+
Please do worry your pretty little head! That’s another great thing about being a scientist. We don’t live in ivory towers. Given the opportunity, anyone can be a scientist. I might take your opinion on quantum mechanics more seriously if you have the education and expertise to back it up, but in the end that is a prior. A spark of genius can come from anywhere.
24+
25+
What should we do?
26+
27+
If you have a doubt in some reported finding, go and ask questions about it. Think about alternative, simpler explanations for it. Design and conduct experiments to test this explanation. Then, report your results to the world and discuss the merits and flaws of your studies. Refine your ideas and designs and repeat the process over and over. In the end there will be a body of evidence. It will either convince you that your doubt was right or it won’t. More importantly, it may also be seen by many others and they can form their own opinions. They might come up with their own theories and with experiments to test them.
28+
29+
Doesn’t this sound like a perfect solution to our problems? If only there were a name for this process…
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+

2+
Let's discuss the method Ashenfelter used
3+
to build his model, linear regression.
4+
We'll start with one-variable linear regression, which
5+
just uses one independent variable to predict
6+
the dependent variable.
7+
This figure shows a plot of one of the independent variables,
8+
average growing season temperature,
9+
and the dependent variable, wine price.
10+
The goal of linear regression is to create a predictive line
11+
through the data.
12+
There are many different lines that
13+
could be drawn to predict wine price using average
14+
growing season temperature.
15+
A simple option would be a flat line at the average price,
16+
in this case 7.07.
17+
The equation for this line is y equals 7.07.
18+
This linear regression model would predict 7.07 regardless
19+
of the temperature.
20+
But it looks like a better line would
21+
have a positive slope, such as this line in blue.
22+
The equation for this line is y equals 0.5*(AGST) -1.25.
23+
This linear regression model would predict a higher price
24+
when the temperature is higher.
25+
Let's make this idea a little more formal.
26+
In general form a one-variable linear regression model
27+
is a linear equation to predict the dependent variable, y,
28+
using the independent variable, x.
29+
Beta 0 is the intercept term or intercept coefficient,
30+
and Beta 1 is the slope of the line or coefficient
31+
for the independent variable, x.
32+
For each observation, i, we have data
33+
for the dependent variable Yi and data
34+
for the independent variable, Xi.
35+
Using this equation we make a prediction beta 0 plus Beta
36+
1 times Xi for each data point, i.
37+
This prediction is hopefully close to the true outcome, Yi.
38+
But since the coefficients have to be the same for all data
39+
points, i, we often make a small error,
40+
which we'll call epsilon i.
41+
This error term is also often called a residual.
42+
Our errors will only all be 0 if all our points lie perfectly
43+
on the same line.
44+
This rarely happens, so we know that our model will probably
45+
make some errors.
46+
The best model or best choice of coefficients Beta 0 and Beta 1
47+
has the smallest error terms or smallest residuals.
48+
This figure shows the blue line that we drew in the beginning.
49+
We can compute the residuals or errors
50+
of this line for each data point.
51+
For example, for this point the actual value is about 6.2.
52+
Using our regression model we predict about 6.5.
53+
So the error for this data point is negative 0.3,
54+
which is the actual value minus our prediction.
55+
As another example for this point,
56+
the actual value is about 8.
57+
Using our regression model we predict about 7.5.
58+
So the error for this data point is about 0.5.
59+
Again the actual value minus our prediction.
60+
One measure of the quality of a regression line
61+
is the sum of squared errors, or SSE.
62+
This is the sum of the squared residuals or error terms.
63+
Let n equal the number of data points that we have in our data
64+
set.
65+
Then the sum of squared errors is
66+
equal to the error we make on the first data point squared
67+
plus the error we make on the second data point squared
68+
plus the errors that you make on all data points
69+
up to the n-th data point squared.
70+
71+
We can compute the sum of squared errors
72+
for both the red line and the blue line.
73+
As expected the blue line is a better fit than the red line
74+
since it has a smaller sum of squared errors.
75+
The line that gives the minimum sum of squared errors
76+
is shown in green.
77+
This is the line that our regression model will find.
78+
Although sum of squared errors allows us to compare lines
79+
on the same data set, it's hard to interpret for two reasons.
80+
The first is that it scales with n, the number of data points.
81+
If we built the same model with twice as much data,
82+
the sum of squared errors might be twice as big.
83+
But this doesn't mean it's a worse model.
84+
The second is that the units are hard to understand.
85+
Some of squared errors is in squared units
86+
of the dependent variable.
87+
Because of these problems, Root Means Squared Error, or RMSE,
88+
is often used.
89+
This divides sum of squared errors by n
90+
and then takes a square root.
91+
So it's normalized by n and is in the same units
92+
as the dependent variable.
93+
Another common error measure for linear regression is R squared.
94+
This error measure is nice because it compares the best
95+
model to a baseline model, the model that does not
96+
use any variables, or the red line from before.
97+
The baseline model predicts the average value
98+
of the dependent variable regardless
99+
of the value of the independent variable.
100+
We can compute that the sum of squared errors for the best fit
101+
line or the green line is 5.73.
102+
And the sum of squared errors for the baseline
103+
or the red line is 10.15.
104+
The sum of squared errors for the baseline model
105+
is also known as the total sum of squares, commonly referred
106+
to as SST.
107+
Then the formula for R squared is
108+
R squared equals 1 minus sum of squared errors divided
109+
by total sum of squares.
110+
In this case it equals 1 minus 5.73
111+
divided by 10.15 which equals 0.44.
112+
R squared is nice because it captures
113+
the value added from using a linear regression
114+
model over just predicting the average outcome for every data
115+
point.
116+
So what values do we expect to see for R squared?
117+
Well both the sum of squared errors
118+
and the total sum of squares have
119+
to be greater than or equal to zero because they're
120+
the sum of squared terms so they can't be negative.
121+
Additionally the sum of squared errors has to be less than
122+
or equal to the total sum of squares.
123+
This is because our linear regression model could just
124+
set the coefficient for the independent variable to 0
125+
and then we would have the baseline model.
126+
So our linear regression model will never
127+
be worse than the baseline model.
128+
So in the worst case the sum of squares errors
129+
equals the total sum of squares, and our R
130+
squared is equal to 0.
131+
So this means no improvement over the baseline.
132+
In the best case our linear regression model
133+
makes no errors, and the sum of squared errors is equal to 0.
134+
And then our R squared is equal to 1.
135+
So an R squared equal to 1 or close to 1
136+
means a perfect or almost perfect predictive model.
137+
R squared is nice because it's unitless and therefore
138+
universally interpretable between problems.
139+
However, it can still be hard to compare between problems.
140+
Good models for easy problems will
141+
have an R squared close to 1.
142+
But good models for hard problems
143+
can still have an R squared close to zero.
144+
Throughout this course we will see
145+
examples of both types of problems.

0 commit comments

Comments
 (0)