Blog Posts

Two Visualizations of MLB Home Run Totals Using Bokeh

4/16/2019

Bokeh is another great visualization that has a few advantages in certain situations over Matplotlib and Seaborn. Bokeh generates interactive, html ready visualizations that can be embedded directly into a website. Here, I''ll show you step by step how to take a data set and generate these interactive visualizations. For this tutorial, I looked at baseball data from the 2018 season to see which of the 6 divisions MLB teams are separated into hit the most home runs. I also look to see how each of the 30 teams in the MLB compared when it comes to home runs hit in one season. The data was taken from Sean Lahman's baseball database. For more detail, see the full Repo on my GitHub.

Loading the data and importing libraries

Selecting the right data

The Pie Chart

Adding the visualization to your blog or website.

Now, this is the real reason, for me at least, to use Bokeh. Now that you have called the output_file function above, you can go to your local directory to find that file. In this case, the "pie.html" file. Open the file, view it in your browser, open the source code (in Chrome you click View -> Developer -> View Source) and copy the HTML code. Since, for this blog post, I am using Weebly, I simply drag an Embed Code box into my blog post and paste my HTML code into box.

As you can see by scrolling over the slices of the pie, the AL East led the majors with 1030 dingers in 2018, with over 100 more home runs hit than the NL Central.

The Dot Plot

A member of the AL East, the New York Yankees led all of Major League teams with 267 home runs.

Now, you can see how useful Bokeh visualizations can be for Data Scientists communicating their findings through blog posts. Rather than having to copy and paste, or screen shot images, you can create clean, interactive visualizations to be embedded into your site.

0 Comments

How to interpret multi-linear regression coefficients when using dummy variables from categoricals

4/6/2019

0 Comments

Never in my life did I think I would be writing a title with such a heavy load of technical jargon. But, for data scientists performing such an operation, this is a big topic. For a project I did in my first module at the Flatiron School, I was given a data frame including housing data from King County , Washington. For more information about this data, please take a look at the Kaggle competition site. I used this data to create a model that would predict, as accurately as possible, housing prices in the county. One of the tricky aspects of this dealing with a large quantity of categorical data. In this blog post, I will be explaining the evaluation aspect of the model, given that so many variables were category types.

1. Converting to Categorical's
Specifically for the zip code variables, I converted to a categorical and created dummy variables so that each individual zip code could be isolated during the regression analysis.

2. Log Transformations and Min-Max Scaling
Once I converted to categorical's, numerical variables needed to be normalized. In addition, I needed to put all of my variables on a 0 - 1 scale in order to run the Recursive Feature Elimination (RFE) and to provide more accuracy in my model.

3. Recursive Feature Elimination (RFE)
Running the RFE showed that 2 different zip codes made it into the list of the top 5 most important features to include in my model.

4. Evaluating the Zip Code Coefficient
When looking at the coefficients in my model, how do I interpret an individual zip code and it's influence on price? Does it mean that increasing from zip code 98039 to zip code 98040 will increase price by a certain amount? Or, does it mean that houses in zip code 98039 will increase price by that amount? Remember that my zip code variables were converted into categorical variables and my price variable was log transformed. So, using any zip code coefficient would essentially mean that price is influenced that much by the house being or not being located in that location. In order to make sense of this coefficient, we would have to convert or predictors back form log transformation and normalization.

0 Comments

PITCHf/x and Average Fastball Velocity Visualizations

3/6/2019

0 Comments

If you were suddenly unable to drive your car at the speed limit on the highway, would you be concerned? Sure, you could slide on over to the slow lane and cruise to your destination with a little more of an easy going, "I don't really need to get anywhere fast", kind of vibe. Or, you could anxiously stare at your dashboard, waiting for the check engine light to turn on while continually looking in the rear view to make sure smoke isn't pouring out of your tailpipe. Whether your talking about your car or your favorite starting pitcher, it is important to know if there is a real problem that requires a trip to the mechanic or not.

A dashboard can now be created for pitchers utilizing data visualization and the average velocity of their pitches over a time period. For starting pitchers in the MLB, PITCHf/x data is analyzed to show average pitch velocity per game. Extremely high definition cameras installed in every baseball stadium in the MLB record information for every pitch thrown. Here is an example of one of the best pitchers in the game, Clayton Kershaw, and his average fastball velocity per game over the 2018 season:

Image provided by Fangraphs.com

This simple graph can tell a coach, a pitcher, an opposing team or hitter a lot. Generally, we can see that in the 2018 season, Kershaw's fastball was mostly in the 90 - 94 mph range. However, many Los Angeles Dodgers and Kershaw fans are worried about his declining velocity over longer periods of time. If we now look at this same information and visualization, but add in 3 more previous seasons, we can see why fans might be worried.

Image provided by Fangraphs.com

In baseball, a pitcher's fastball velocity is one of the biggest measurements of talent. There have been a few studies done on a pitchers and the impacts of a declining fastball velocity, but for now, let's just say it's usually not associated with good. However, because of visualizations like these being produced for pitchers and baseball organizations, better decisions can be made such as cutting back on the innings an aging pitcher takes the mound during a season, or changing the types of pitches a pitcher throws. There have been many aging pitchers who have used visualizations like these to change their style of pitching to include a heavier reliance on lower speed pitches and effectively extend their careers.

Thanks to PITCHf/x technology, what was once written down on a clipboard after looking at a radar gun and spitting tobacco juice in between pitches, is now automated and used to create useful visualizations.

0 Comments

Why Learn Data Science?

2/13/2019

0 Comments

I was first introduced to the practical use of data science in the form of professional baseball. Reading books like, Moneyball, by Michael Lewis, Big Data Baseball, by Travis Sawchik, and more recently, Astroball, by Ben Reiter, I started to learn the ways in which data science was giving baseball teams a competitive edge. I was intrigued by the way mathematics and a deep analysis of data could project a baseball player's likelihood to succeed in the major leagues and how analysts were using big data to find hidden value in players.

Take for example the story of catcher Russell Martin as written in Big Data Baseball. Author Travis Sawchik writes of how data scientists working for the Pittsburgh Pirates analyzed pitch data, using MLB's Pitchf/x technology, to discover Martin's ability to "frame" pitches and how that ability translated to more wins for the team overall.

By looking at data on called balls and strikes by the umpire during Martin's career, they found that the amount of pitches that should have been called balls that Martin was able to "frame" into called strikes was above average. Analysts were then able to use this statistic to calculate how many runs the catcher Martin was able to prevent, given the number of pitches he was able to manipulate into called strikes by the umpire, simply by the way in which he caught the ball and subtly moved his glove into the strike zone, otherwise known as pitch framing. This ultimately led to more strikes called for the Pirates pitchers and gave the team a competitive advantage. Only through the analysis of large data sets were the Pirates able to see that Martin had this ability and sign him to a contract from the free-agent market, obtaining an edge at discounted cost.

This chart provided by Stat Corner shows that in 2018 Russell Martin is still within the top 15 catchers in the major leagues with top pitch framing ability, with an oStr% (called strikes that were outside of the pitch zone that should have been called balls) of 8.3.

As I became more and more interested in baseball data and the research questions analysts were answering using publicly available baseball data, I was led to Fangraphs.com where bloggers ask and answer great baseball questions on a daily basis. I started to crave the ability to analyze data on my own and wanted to write my own blogs about baseball data findings, but I had no idea how to use the R and Python programming tools that analyze these large data frames. So, I bought the book, Analyzing Baseball Data with R and started to teach myself.

Slowly but surely I began experimenting and learning more and more. I began writing my baseball articles where I could communicate some of my findings. The more I learned the more I started to think beyond baseball. Wondering where else this work was being done, I started to research the job market for data scientists.

This graph, taken from KDnuggets, shows the growth in the job market with data scientist positions increasing 6.5 times as compared to 5 years ago.

I began to see that there is data to be analyzed everywhere and the opportunities to find my niche in the world of data science is very practical. As I continue on this journey of becoming a data scientists, I hope to mold together the experiences and skill sets from my current career as a science teacher and environmental educator, with a new skill set in technology. I am particularly excited about the prospect of continuing to wear my teacher hat by presenting technical and statistical findings to others and explaining a complex analysis in a way that everyone can understand. In the short term, I hope to learn everything I can about the tools and programming styles that analyze the large datasets that move big decisions forward in today's world. I hope to earn a position working for an organization to help move those big decisions and make an impact on the world. Finally, I hope to continue my baseball blog and data analysis in hopes that one day I could become a (somewhat) respectable analyst and baseball writer.

0 Comments

Forward>>

Data Science Blog

Two Visualizations of MLB Home Run Totals Using Bokeh

Loading the data and importing libraries

Selecting the right data

The Pie Chart

Adding the visualization to your blog or website.

The Dot Plot

How to interpret multi-linear regression coefficients when using dummy variables from categoricals

PITCHf/x and Average Fastball Velocity Visualizations

Why Learn Data Science?

Archives

Categories