DSCI Student Seminar

12:00 - 12:50 PM, Wednesday, April 19, Gildemeister 155

Refreshments served beforehand in Math/Stat Student Lounge (Gild 135). 


Application of Clustering Algorithms to Finance Data 

Shane Will 

The pursuit of finding patterns in financial asset data is both an interesting academic exercise and a potentially profitable one. If an individual can find and exploit patterns in the data, there is a very large potential for profit. To find patterns, one can use clustering analysis to extract patterns from stock data. While performing the preliminary analysis, it was assumed that the data structure would be broken down by industry or sector. After investigating the clusters, it was found that these clusters were not structured by traditional business sectors. Along with investigating the existing structures simulations were run to better understand what the clustering algorithm does. The results were unexpected and interesting. All companies in a cluster had an underlying connection such as a commodity or similar industry service. This goes to show that simply expecting industry or sectors to move together may not be as effective as tracking underlying connections between companies. Through tracking underlying causes one may be able to better predict and react to changes in the market.

Measuring and Modeling Batted Ball Quality 

Nick Schroeder 

Baseball is a competitive game of offense, defense, athleticism, strategy, and numbers. The recent advent of the StatCast technology has yielded data on the exit velocity, hit angle, coordinates of the ball when it passes the strike zone, the spin rate of the baseball, the break angle, and much more. This data is available at baseballsavant.mlb.com. In this project, we investigate the following question: What pitching characteristics explain the quality of a hit ball? We used principal component analysis to analyze what a well hit ball is, what pitching variables affect the qualities of well-hit balls. We fit various models to explain exit velocity. We used five modeling techniques: full multiple linear regression, forward variable selection, backward variable selection, ridge regression, and the lasso. Using data for Mike Trout, each model had a low predictive performance. These R-squared value were calculated on the test data using 2/3, 1/3 cross-validation. Once the models were fit, we looked at the lasso coefficients. The top two largest coefficients were locations that corresponded to the upper-in and upper-middle strike zones. A possible implication is a pitcher may not want to pitch Trout on the inside. The next step in the project is to add more players through the modeling and coefficient process. An interactive Tableau visualization is in process. In this visualization, people will be able to see how different pitch types and locations affect the exit velocity for different players.