Who's up for a challenge to win a billion?

Discussion in 'The Thunderdome' started by CardinalVol, Jan 21, 2014.

  1. fl0at_

    fl0at_ Humorless, asinine, joyless pr*ck

    Yea. Like I said. Not groundbreaking.

    For whatever reason, 9s win at almost a 70% rate in the East. But not sure if statistically significant.
     
  2. kidbourbon

    kidbourbon Well-Known Member

    I'd be interested in looking at the Vegas future odds going into the tournament vs. results. It would take some time to do, though. From the future odds, you could assign not just a ranking to each team but a score, which is more useful than ranking because it indicates not just that #1 is better than #2, but also how much better.

    With this data, you could:
    (1) assign a likelihood of winning for each game, and also for each team advancing through each round.
    (2) Compare that with actual results.
    (3)???
    (4) Profit?


    I know there are computer rankings that provide better predictive value than the seedings (it's well documented). And I'm confident that Vegas future odds would provide better predictive value than the seedings. But I wonder how Vegas futures stack up against some of the best algorithmic ranking systems (kenpom, LRMC, Sagarin, etc.)?

    But I'm not doing the research, so I'll have to keep wondering.
     
  3. kidbourbon

    kidbourbon Well-Known Member

    I'm confident a larger sample size would smooth out the 9 seed thing and also the 5 and 6 seed percentages.
     
  4. kidbourbon

    kidbourbon Well-Known Member

    Numbers
     
  5. fl0at_

    fl0at_ Humorless, asinine, joyless pr*ck

    I agree. Wonder if it is even worth looking at all the previous years, just for shits and giggles.

    But the 8 and 9 seed should be almost 50/50 anyway, so an overall 54/46 difference doesn't seem that significant to me.
     
  6. fl0at_

    fl0at_ Humorless, asinine, joyless pr*ck

    Drop me a link to the future odds, and if I can find a way to easily extract whatever numbers, we can throw them in a spreadsheet and see.
     
  7. dknash

    dknash Chieftain

    It seems like 5 seeds are generally flawed major conference squads and 12 seeds are generally sound mid-majors. Maybe less so with the expansion of the field. I don't think it's a total fluke.
     
  8. kidbourbon

    kidbourbon Well-Known Member

    I'm gonna have to keep looking. This site here is the best for archived point spreads: http://www.goldsheet.com/gs_new/histcbb.php

    But I'm not finding tournament odds.
     
  9. kidbourbon

    kidbourbon Well-Known Member

    Boom!

    Found it. This site is gold. http://www.sportsoddshistory.com/aa_php/main.php?y=2009-2010&s=cbb&a=nc&o=r


    You can sort by column, so you'd wanna sort by the "odds prior to round 1" column.

    It only has the past four years, but it's a start.

    To convert moneyline odds to implied winning percentage, do: 100/(ML + 100). So let's say Syracuse was +800. 100/(800 + 100) = .11 = 11%.

    But what you'll find is that when you add up all those percentages, it will be well above 100%. This is because of the juice that's built into the listed odds. There's probably an easy way to strip out the juice so that the numbers add up to give you 100%.

    In fact, I think you would just do this: X*(sum of all the percentages (listed of course as decimals)) = 1.
    Solve for X.
    Multiply each decimal by X.

    I think that would do the trick. The result is "true odds"...or whatever the name is for odds that actually represent likelihood of an outcome without the addition of juice.
     
    Last edited: Jan 30, 2014
  10. kidbourbon

    kidbourbon Well-Known Member


    Eh, it's probably a fluke. I doubt the committee has a label next to the five seed slot indicating "insert flawed major conference team".
     
  11. kidbourbon

    kidbourbon Well-Known Member

    How good are your web scraping skills? I seriously need to block off a weekend and learn how to do that. That's just a really useful thing to be able to do.

    tennisabstract.com is an absolute gold mine for tennis stats. It's amazing. The guy that does it, Jeff Sackman, just keeps adding on new shiz too. It's a great resource, but I'd love to be able to pull the data off it to use to create my own tennis rankings based on margin of victory and schedule strength.

    So for example, this page here is a list of every match Rafael Nadal has played over the last 52 weeks. http://www.tennisabstract.com/cgi-bin/player.cgi?p=RafaelNadal&f=o1

    What I would like to be able to do is compile a database (a spreadsheet would do the trick) that has, say, the top 50 players, and for each player it has their last [X] number of opponets, or their opponents through a given period of time, and the "dominance ratio" for each match. The dominance ratio is the % of return points won / % of serve points lost, but it doesn't even need to be calculated because he's already calculated it for each match (under the DR heading in the above link).

    From there, I would create a fictional player whose stats all correspond to the tour averages. My general framework would then be to calculate the result if each actual player played this fictional player, and then for each match played over, say, the past year by each of the top 50 players, I would compare the actual result with the result of the fictional match.

    So basically play the top 50 against the fictional guy in a computer simulated match (this is easy to do...you just plug in serving percentage data and returning percentage data and the computer spits out a result). Rank the top 50 based on these results. But this is just a "preliminary" ranking that gives us a jumping off point from a SOS perspective.

    And then for every real match:
    Isner is [X]% better than average player
    Djokovic beat Isner and with an [X] dominace ratio
    Use the last two lines to come up with single number that is basically "the data from this single match suggests Djokovic is [X]% better than the average player.

    And then iterate accordingly through each player and each match. The result is a metric that ranks the players based on how good their opponents were, and what essentially amounts to their margin of victory of the opponent. And because tennis is such a "connected" sport (i.e most players have played each other several times..so if you pick two of the top 50 players at random, and then pick a third player at random, odds are both of the first two guys will have played the third guy, and multiple times to boot), I think the end result would be a pretty damn accurate little metric. And way way way way way better than ATP rankings.

    And, honestly, doing the coding -- via excel or otherwise -- for the ranking system wouldn't be too hard. But I first need the data. And I don't know how to web scrrape.
    So if you think you could scrape that site for the data discussed above and maybe a few other nuggets, I would compensate you for your time.
     
  12. Beechervol

    Beechervol Super Moderator

    Don't waste your time. I've already picked the field and winning order.

    Changes are coming. Hide yo kids......
     
  13. fl0at_

    fl0at_ Humorless, asinine, joyless pr*ck

    I'll need to read through this whole post. I skimmed, but wanted to let you know I saw it.

    I use a couple different PHP modules to read all the HTML, and then parse the tags.

    The one I have used for a while is simple html dom (http://simplehtmldom.sourceforge.net). It isn't very robust, but I'm most familiar with it.

    I'll see what I can do. Getting the data off the site shouldn't be tough. Getting it into excel might be tricky, but there is probably a module out there I can plug in to.

    It might be a few days before I can play with it, but I'll look over it today if I can get some down time.
     
  14. kidbourbon

    kidbourbon Well-Known Member

    I don't know much about web-scraping but what I noticed about the tennis abstract site is that the page source doesn't have the stats in it. I assume that means that the numbers are being calculated on the server side, and are showing up on the page as the return of function calls. (not sure if I explained that very well....my coding days are well in rearview)
     
  15. fl0at_

    fl0at_ Humorless, asinine, joyless pr*ck

    This is actually the file you want:
    http://www.minorleaguesplits.com/tennisabstract/cgi-bin/jsmatches/RafaelNadal.js

    But then you just have to figure out what all the noise means.


    Alternatively, might be able to just pull the page, and insert a javascript directive to write the tags, rather than just display them.

    I think it is somewhere around the "make" function, but rather than returning the call, you could, in theory (I'd need to look more into it) also have it execute document.write which would then just write the damn HTML tag rather than just display it.
     
  16. kidbourbon

    kidbourbon Well-Known Member

    Do you know Python? I've heard it's great for these sorts of things.
     
  17. kidbourbon

    kidbourbon Well-Known Member

    And figuring out the link I needed to be looking at from the page source and pulling it from the top seems obvious in hindsight, but I didn't know to do that, and probably wouldn't have figured it out. So, biggie ups.
     
  18. fl0at_

    fl0at_ Humorless, asinine, joyless pr*ck

    I'm not "up" on python. I have the basics down, but it isn't something I know well enough to pull something like this off with any speed.

    I played with his code today, and it is a bugger.

    I can get all the stats, load them into an array of strings. But this is the part that is pissing me off... once the page finishes loading, the variable I created, seems to become non-existent. And it doesn't seem to matter when I try to write the data out so I can read it, I can never get it to write the data.

    So I can dump all the stats I want into the console, but I can never manage to physically put them somewhere where I can extract that info.

    It is very well written.

    I think I'm going to have to be forced, if I continue to play with this, to move to Ghost and python and see if I can dump the data that way. But I'm not optimistic it is going to work, which means I'm hesitant to devote the time.

    It would pretty much just be easier to just extract that data file's proper name from each of the top 50 players, and then do the math to obtain the stats you want myself, and then tabulate the data. Because right now, there is just no easy way to scrap this site.
     
  19. kidbourbon

    kidbourbon Well-Known Member

    (1)
    I'm not sure I'm following. You can dump the data and then it goes vamoose?


    (2)
    I'm not surprised. That dude is smart. He's not famous, and so I'm pretty sure he has a day job, and to put up a site like that in your spare time, and to add new features as frequently as he's been adding them -- I think by March the site may actually be able to perform fellatio -- one would have to be pretty bright.

    (3)
    If you tihink it's going to take a lot of time, and you're not positive it's gonna work, then don't sweat it. His site displays more and better tennis data than any other site out there -- even sites you'd have to pay a subscription to access -- and by a comfortable margin. But he's getting the raw data from somewhere else. And I'll have to assume that the "somewhere else" is easier to scrape.

    (4)
    I'm not sure what you mean by this. He also gets his match data from somewhere, but I don't know where that somewhere is.
     
    Last edited: Jan 31, 2014
  20. fl0at_

    fl0at_ Humorless, asinine, joyless pr*ck

    .
     

Share This Page