This week's challenge was brought to us from Simon Evans. He used to work as an analyst at a professional sports team so we looked at a similar data set that a football analyst will have to make their way through. There were a lot of fields and lots of the information weren't important to the questions that we wanted to answer, therefore this week was good practice about how to deal with these larger data sets.
Step 1 - Input All Files
The first step is to input all of the files for the challenge. We are given data from 5 different seasons and all of these csv files have the same structure meaning that we can use the wildcard union input so that all of the tables are stacked on top of each other.
Notice how we have left the matching pattern section blank as we want to include all of the files.
Step 2 - Filter & Calculate Open Play Goals
Now we have all of the files, we can start to filter and clean our data. First we want to filter some rows that we aren't interested in, these include excluding 'Goalkeeper' from the Position field and also excluding 0 from the Appearances field.
Next we want to replace any null values, in the Penalties Scored & Freekicks Scored, with a 0. This can be done by double clicking on the 'null' value and typing 0 as a replacement.
Finally we can calculate the Open Play Goals with the following calculation:
Open Play Goals
[Goals] - [Penalties scored] - [Freekicks scored]
Then renaming the Goals field to Total Goals Scored.
As our table has so many field, it can be hard to navigate and find the fields that you are looking for. I'd recommend using the 'Search' feature as this is a great way to limit the fields that you are viewing and helps to minimise the amount of scrolling that you need to do!
Step 3 - Aggregate Key Metrics
Now we have our calculation ready, we want to limit our fields and focus on the information that will help us answer the required questions.
As we are looking at 5 season's worth of data, we want to aggregate these so that we have a total for each player over this period. Our aggregation tool looks like this:
Note, this time I'm using the metadata view so again it's easier to see all of the fields without having to scroll.
After the aggregate our data should now look like this:
Step 4 - Output 1
In the first output we want to look for the top performing players in terms of goals scored and how many goals they have scored per appearance made.
To calculate the goals per appearance we use:
Open Play Goals per Appearance
[Open Play Goals]/[Appearances]
We then want to rank our players on how many Open Play Goals they have scored so we use a rank calculation:
Then finally we only want to keep the top 20 players, so we can use a filter on the rank
Our table for the first output looks like this and as you can see, Dele Alli is the only non forward to make the list:
Step 5 - Output 2
For the second output we want to focus on the top players in each position. We start we another rank calculation, but this time we include position into the group by:
Then we filter for the top 20 again on the rank field. This time out table should look like this:
You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help!
Created by: Carl Allchin Welcome to a New Year of Preppin' Data challenges. For anyone new to the challenges then let us give you an overview how the weekly challenge works. Each Wednesday the Preppin' crew (Jenny, myself or a guest contributor) drop a data set(s) that requires some reshaping and/or cleaning to get it ready for analysis. You can use any tool or language you want to do the reshaping (we build the challenges in Tableau Prep but love seeing different tools being learnt / tried). Share your solution on LinkedIn, Twitter/X, GitHub or the Tableau Forums Fill out our tracker so you can monitor your progress and involvement The following Tuesday we will post a written solution in Tableau Prep (thanks Tom) and a video walkthrough too (thanks Jenny) As with each January for the last few years, we'll set a number of challenges aimed at beginners. This is a great way to learn a number of fundamental data preparation skills or a chance to learn a new tool — New Year...
Created by: Carl Allchin Welcome to a New Year of Preppin' Data. These are weekly exercises to help you learn and develop data preparation skills. We publish the challenges on a Wednesday and share a solution the following Tuesday. You can take the challenges whenever you want and we love to see your solutions. With data preparation, there is never just one way to complete the tasks so sharing your solutions will help others learn too. Share on Twitter, LinkedIn, the Tableau Forums or wherever you want to too. Tag Jenny Martin, Tom Prowse or myself or just use the #PreppinData to share your solutions. The challenges are designed for learning Tableau Prep but we have a broad community who complete the challenges in R, Python, SQL, DBT, EasyMorph and many other tools. We love seeing people learn new tools so feel free to use whatever tools you want to complete the challenges. A New Year means we start afresh so January's challenges will be focused on beginners. We will u...
Free isn't always a good thing. In data, Free text is the example to state when proving that statements correct. However, lots of benefit can be gained from understanding data that has been entered in Free Text fields. What do we mean by Free Text? Free Text is the string based data that comes from allowing people to type answers in to systems and forms. The resulting data is normally stored within one column, with one answer per cell. As Free Text means the answer could be anything, this is what you get - absolutely anything. From expletives to slang, the words you will find in the data may be a challenge to interpret but the text is the closest way to collect the voice of your customer / employee. The Free Text field is likely to contain long, rambling sentences that can simply be analysed. If you count these fields, you are likely to have one of each entry each. Therefore, simply counting the entries will not provide anything meaningful to your analysis. The value is in ...