How to... Connect to Data in Files

One of the first steps you will take in any Data Preparation, Visualisation or analytics project is to Input Data. These inputs can come in many different forms but the most common still remains Files.

Files, upon files, upon files


For many, our first experiences of working with data probably involve one piece of ubiquitous software, Microsoft Excel. The spreadsheet has become the jack-of-all-trades for data storage, manipulation and basic reporting. Every organisation across the world has tens, if not hundreds of spreadsheets saved on every computer. All of these will contain useful information, raw data or reference tables that will help answer the questions experts are posing. 

The issues spreadsheets create

There are a number of potential issues this creates that whilst aren't unique to spreadsheets, they are considerations to make:
  • Accuracy - how has the data been changed since being added to the spreadsheet? 
  • Timeliness - how old is the data? 
  • Optimisation - any manipulation or analysis is often manual and non-repeatable
  • Size - with limits on rows and columns, sometimes the data grows beyond spreadsheets
  • Sharing - multiple people using multiple sheets can create confusion and errors
Organisations are complex. Merges, acquisitions and splits in organisations have created a complex history and network to unpick and the data often reflects this. Trying to ensure the data is what you think it is can be hard to prove and then maintain that clarity. 

It's not just Excel

Comma Separated Variables (CSV), Text, Spatial, Statistical files are all forms where data is likely to be living inside your computer that could help you answer the questions you have. All these file types are created and used for multiple reasons. Often they are default outputs from different programmes where you might be working with operational or analytical systems in your organisation. So what are these files and where do they come from:

CSV: can be output as a data storage file. Whether you are sharing data for programmes that are not in the Microsoft Suite or just simply moving data in to databases, CSVs are very flexible file types. Columns of data are separated by commas (hence the name) with new rows of data held on separate lines within the file. 

Text: are even more simplistic and therefore a smaller file to store. Again suited for holding data in a very simple format, text files have the potential to be more problematic to work with due to their ability to contain very different formats. 

Spatial: often created by Geographic Information Systems (GIS) tools, Spatial files can contain a lot of information to help spatial analysis. Spatial data files hold simple geographic objects like points, lines and polygons. These spatial objects can show a lot about data point relationships and patterns.  

Statistical: R and Python are becoming common tools to be taught in universities. Therefore, they are being demanded more in the workplace, especially by 'data scientists' and those working with statistics. Just because a model is created by a data scientist, doesn't mean the results are not required by experts in your organisation to complete further analysis. 

PDF: These files are becoming more common place and often contain useful datasets inside them. Depending on how the PDF has been formed and formatted makes a massive difference as to how easy these files are to be used and prepared.

Flexibility is not your friend

These file types can help hold a range of data and in lots of different formats. It's one of the many reasons they are used but creates a significant set of issues for us as Data Preppers. Each file will have its own structure, messiness and set of challenges that need to be overcome before the data can be used for analysis or augmented with other data sets to add extra value. 

Where to find your data files

They could be anywhere and everywhere. Knowing the right people in your organisation who may have collaborated with others or formed the files themselves is key. Here are a few considerations to make about where the files are held once you track them down:
On your computer formed by you
Building / forming your own data sets is a common task in the workplace. This might be from capturing numbers out of reports you read, records captured by you or aggregations of other data sets sent to you. 

On your computer formed by others
Emailed database extracts, survey results or market research are also common ways to receive data. The greatest challenge with this data is get the frequent updates you require to keep your analysis fresh. Getting information and insight in to data item definitions and 'freshness' of the data is easier as you are likely to know who to ask... now you just need to wait for them to reply. 

Cloud Storage
Team drives held on Cloud computers and servers create a new challenge as others will be adding files to shared drives. This makes tracking down the sources for the information a lot tougher. For example, such tasks as getting definitions of what data points mean, understanding what aggregations have been applied or knowing what data has been removed may be more difficult. Even the task of refreshing the dataset may be more challenging as the origins of the file(s) are difficult to trace.

So why bother?

The flexibility means so many useful bits of data will exist in these files. Whether it is projected budgets or targets through to the latest reorganisation structure to be applied to your earnings, you will need to be comfortable preparing these data sets for analysis. Now rather than using manual manipulation, you will be able to use Tableau Prep to be able to pick apart the useful from the dross and structure the data to be ready for your analysis. 

How to connect to these files in Prep

When opening Tableau Prep Builder you will find the 'Home' screen very similar to Tableau Desktop. To make any data connection, you will need to click on the '+' next to the word Connections.

This will expand the choice of connections you can make in Prep Builder to show all the connection types including the File Types listed above (except Spatial files as of version 2019.4).

By clicking on any of the file types, you will open a 'File Explorer' (Windows) or 'Finder' (Mac) to show the location of your file. Once you find the file, click open to make the connection to that file. Now it's time to prepare your data for your purposes.

Popular posts from this blog

2023: Week 1 The Data Source Bank

2023: Week 2 - International Bank Account Numbers

How to...Handle Free Text