This week's challenge was brought to us from Simon Evans. He used to work as an analyst at a professional sports team so we looked at a similar data set that a football analyst will have to make their way through. There were a lot of fields and lots of the information weren't important to the questions that we wanted to answer, therefore this week was good practice about how to deal with these larger data sets.
Step 1 - Input All Files
The first step is to input all of the files for the challenge. We are given data from 5 different seasons and all of these csv files have the same structure meaning that we can use the wildcard union input so that all of the tables are stacked on top of each other.
Notice how we have left the matching pattern section blank as we want to include all of the files.
Step 2 - Filter & Calculate Open Play Goals
Now we have all of the files, we can start to filter and clean our data. First we want to filter some rows that we aren't interested in, these include excluding 'Goalkeeper' from the Position field and also excluding 0 from the Appearances field.
Next we want to replace any null values, in the Penalties Scored & Freekicks Scored, with a 0. This can be done by double clicking on the 'null' value and typing 0 as a replacement.
Finally we can calculate the Open Play Goals with the following calculation:
Open Play Goals
[Goals] - [Penalties scored] - [Freekicks scored]
Then renaming the Goals field to Total Goals Scored.
As our table has so many field, it can be hard to navigate and find the fields that you are looking for. I'd recommend using the 'Search' feature as this is a great way to limit the fields that you are viewing and helps to minimise the amount of scrolling that you need to do!
Step 3 - Aggregate Key Metrics
Now we have our calculation ready, we want to limit our fields and focus on the information that will help us answer the required questions.
As we are looking at 5 season's worth of data, we want to aggregate these so that we have a total for each player over this period. Our aggregation tool looks like this:
Note, this time I'm using the metadata view so again it's easier to see all of the fields without having to scroll.
After the aggregate our data should now look like this:
Step 4 - Output 1
In the first output we want to look for the top performing players in terms of goals scored and how many goals they have scored per appearance made.
To calculate the goals per appearance we use:
Open Play Goals per Appearance
[Open Play Goals]/[Appearances]
We then want to rank our players on how many Open Play Goals they have scored so we use a rank calculation:
Then finally we only want to keep the top 20 players, so we can use a filter on the rank
Our table for the first output looks like this and as you can see, Dele Alli is the only non forward to make the list:
Step 5 - Output 2
For the second output we want to focus on the top players in each position. We start we another rank calculation, but this time we include position into the group by:
Then we filter for the top 20 again on the rank field. This time out table should look like this:
You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help!
Created by: Carl Allchin Welcome to a New Year of Preppin' Data. These are weekly exercises to help you learn and develop data preparation skills. We publish the challenges on a Wednesday and share a solution the following Tuesday. You can take the challenges whenever you want and we love to see your solutions. With data preparation, there is never just one way to complete the tasks so sharing your solutions will help others learn too. Share on Twitter, LinkedIn, the Tableau Forums or wherever you want to too. Tag Jenny Martin, Tom Prowse or myself or just use the #PreppinData to share your solutions. The challenges are designed for learning Tableau Prep but we have a broad community who complete the challenges in R, Python, SQL, DBT, EasyMorph and many other tools. We love seeing people learn new tools so feel free to use whatever tools you want to complete the challenges. A New Year means we start afresh so January's challenges will be focused on beginners. We will use dif
Challenge By: Jenny Martin For week 2 of our beginner month, Data Source Bank has a requirement to construct International Bank Account Numbers (IBANs), even for Transactions taking place in the UK. We have all the information in separate fields, we just need to put it altogether in the following order: Inputs A list of the transactions, with information about the receiving bank account A lookup table for the SWIFT Bank Codes Requirements Input the data In the Transactions table, there is a Sort Code field which contains dashes. We need to remove these so just have a 6 digit string ( hint ) Use the SWIFT Bank Code lookup table to bring in additional information about the SWIFT code and Check Digits of the receiving bank account ( hint ) Add a field for the Country Code ( hint ) Hint: all these transactions take place in the UK so the Country Code should be GB Create the IBAN as above ( hint ) Hint: watch out for trying to combine sting fields with numeric fields - check data types Re
Free isn't always a good thing. In data, Free text is the example to state when proving that statements correct. However, lots of benefit can be gained from understanding data that has been entered in Free Text fields. What do we mean by Free Text? Free Text is the string based data that comes from allowing people to type answers in to systems and forms. The resulting data is normally stored within one column, with one answer per cell. As Free Text means the answer could be anything, this is what you get - absolutely anything. From expletives to slang, the words you will find in the data may be a challenge to interpret but the text is the closest way to collect the voice of your customer / employee. The Free Text field is likely to contain long, rambling sentences that can simply be analysed. If you count these fields, you are likely to have one of each entry each. Therefore, simply counting the entries will not provide anything meaningful to your analysis. The value is in