Transform SQL Data

 Now that you have created your SQL database and imported some data into it, we need to work through the data tables and standardize the existing information and create table relationships to constrain column values. Below are a set of steps that I would take if this were my data. Please pay attention to detail!!!! 

One of the things that you will need to pay constant attention to is making sure that within SQL Management Studio you always have YOUR database as the selected one because all of us are working on identical data simultaneously. You don't want to adjust someone else's data thinking that it is your data...

Recall that we have many goals when creating a database. One goal is to store data one time and one time only. Secondly, we want to make sure that our data within each column is of a specific type and that the values are what we expect. Let's work through the tblFieldData table one column at a time.

  1. The 'base-piece' of this data are the locations of where each sample was collected. So, let's rearrange some of the columns and remove an unnecessary column.
     
    • Right-click on tblFieldData and select 'Design'.
       
    • Select the 'Plot' column and move/drag it to be the first column. I like having the Primary Key columns be first within a table. Now click the Save button to save the schema.
       
    • Select the 'PlotArray' and move/drag it to be the second column. Click Save
       
    • Because we have Treatment stored in our fcPlots table, we don't need this column in the Field Data because it is related to Plot. Right click on 'Trtmt' and select Delete Column. I promise you are not losing any data... Click Save
       
  2. The 'DirOfMvmt' column is categorical, so we should check to see what values exist within this column. 
    • Run the following query

      SELECT DirOfMvmt
      FROM tblFieldData
      GROUP BY DirOfMvmt
       
    • You will notice in the results that there are 6 different values. In this case we should make the two non-directional values be equal to 'UNK' (for unknown).Write a SELECT query to test your filter is grabbing the rows that you truly want to update.

      SELECT DirOfMvmt
      FROM tblFieldData
      WHERE DirOfMvmt = '-' OR DirOfMvmt IS NULL
       
    • Update the values using an UPDATE query:

      UPDATE tblFieldData
      SET DirOfMvmt = 'UNK'
      WHERE DirOfMvmt = '-' OR DirOfMvmt IS NULL
       
    • Rerun the SELECT query to see if you get the expected results.
       
    • Hopefully you noticed that the DirOfMvmt column allows for 255 characters. That is FAR TOO MANY for this column. Go to table Design and change the nvarchar(255) to nvarchar(3) (the maximum character length).
       
  3. Next we need to test if the SpecCode from tblFieldData all match the SpecCode in luSpecies. To perform this test, we are going to dynamically join the two tables together using the JOIN command. There are several JOIN types. We are using an OUTER JOIN, which will return all of the rows of data from tblFieldData and only the matching rows from luSpecies. We can then look for NULL values within the Expr1 column...if a NULL is returned, it means that the SpecCode in tblFieldData does not exist within luSpecies. We don't want that to happen because we want to constrain all of our Species to be something that we recognize. Hence, if there were to be unmatched values, we would want to change the values in one or both of the tables depending on the scenario.
    • Run the following query to test for unmatched values between the two tables:

      SELECT tblFieldData.Plot, tblFieldData.PlotArray, tblFieldData.SampDate, tblFieldData.SpecCode, luSpecies.SpecCode AS Expr1
      FROM tblFieldData LEFT OUTER JOIN
      luSpecies ON tblFieldData.SpecCode = luSpecies.SpecCode
       
    • If you scroll through your results, you will notice that there aren't NULL values. How would you determine if there were NULL values if we had 1 million rows of data?

      Keep thinking.....




      Well you would use a WHERE clause of course! Add a WHERE clause to your SELECT statement:
       
      • WHERE luSpecies.SpecCode IS NULL

         
    • You may be wondering how to create the JOIN. Well, you can simply type your SQL manually. Not a bad choice, but it can take longer to develop code than one might feel like spending. SQL Management Studio does provide some GUI options to create queries.
       
      • Make sure that your Database is highlighted. Then click the 'New Query' button. You should get a blank *.sql tab.
         
      • Right-click in the blank area and select 'Design Query in Editor'
         
      • Add the luSpecies and tblFieldData tables and then close the dialog box.
         
      • Next we need to add the fields we want returned in our query. Check the boxes within tblFieldData for Plot, PlotArray, SampDate and SpecCode
         
      • Check the box for SpecCode with luSpecies
         
      • Now we need to JOIN the two tables by a common field(s). Click and Drag luSpecies.SpecCode over to tblFieldData.SpecCode. You should see a new graphic displaying the JOIN. 
         
      • Right-click on the 'diamond' of the JOIN. Choose Select all Rows from tblFieldData


         
      • Click OK. The click the Execute button to make use of your SQL Statement.
         
      • You have just created that same query that we previously executed, but using a GUI. Fantastic!! 
         
        • As a word of advice, if you want to view an existing SQL statement within the Query Designer, you need to select/highlight the SQL, then Right-click and choose the Query Designer. It will provide the GUI-view of your SQL statement...
  4. Perform the same actions that you did in Step 2 on the following columns: Sex, Age, Breeding
     
    • For Sex, make '-' and NULL = 'UNK'
       
    • For Age, make '-' and NULL = 'UNK'
       
    • For Breeding, make '-' and NULL = 'UNK'
       
    • Don't forget to change the Design of the each of these columns...!
       
  5. The next issue that we need to solve is that we have several columns of data that are numeric, but because the data were entered so that NULL values were delineated with '-', we had to import the column as a character. Let's deal with the SVL column first
    • Since this is a character column, let's check out the Unique values within the column. Concurrently, we are going to COUNT the number rows for each unique value:

      SELECT SVL, COUNT(*)
      FROM tblFieldData
      GROUP BY SVL

       
    • You will notice that there are 3 values that are going to give us a hard time ('-', '>150', and '0'). This is where is you need to make executive decisions about the data. The following are the decisions I made concerning this data. You may have made different decisions...
      • Change the '>150' value to = '151'. I made this choice because there is only 1 row of data with this value, very unlikely to influence the results. Test your WHERE clause first:

        SELECT SVL
        FROM tblFieldData
        WHERE SVL ='>150'

         
      • If only one row is returned, then UPDATE this row of data

        UPDATE tblFieldData
        SET SVL = '151'
        WHERE SVL = '>150'
         
      • Since you can't have a 0 length, let's set the other values to NULL. First make sure the WHERE clause is getting the correct rows.

        SELECT SVL
        FROM tblFieldData
        WHERE SVL ='0' OR SVL = '-'
         
      • Now UPDATE the values

        UPDATE tblFieldData
        SET SVL = NULL
        WHERE SVL ='0' OR SVL = '-'
         
      • Only numbers now exist in the column, so change the data type from nvarchar to float
      • Perform these same steps for the HF_mm, Tail_mm, Body_mm and Wt_g columns. PAY ATTENTION TO DETAIL. This is important data, you don't want to make a silly mistake!!! Speed is NOT your goal, ACCURACY and CONSISTENCY is...
         
  6. We will just accept the Voucher column 'as-is'.
     
  7. You will notice that data within the columns Temp_H through Moon all have the same data for each row of the same SampDate. This is duplication of data and violates our goal of storing data one time, and one time only (Normalization). We need to remedy this issue. The best way to do that is to create a new table that contains this information and relate it to the other tables/data. First, let's try an assess what makes a unique row of data:
    • We can use the GROUP BY clause to determine Unique rows of data (this is the same as Aggregation, right? ). I have added in an ORDER BY so that we could easily look at the values across Plots for the same SampDate

      SELECT Plot, SampDate, Moon, Precip_prev_24_hours, Precip_current_day, Humidity_L, Humidity_H, Temp_L, Temp_H
      FROM tblFieldData
      GROUP BY Plot, SampDate, Moon, Precip_prev_24_hours, Precip_current_day, Humidity_L, Humidity_H, Temp_L, Temp_H
      ORDER BY SampDate
       
    • Every Column that is included in the GROUP BY is used to create the Unique list of rows. This is a very important behavior of SQL that you must master. It is essential! What we can discern from the query results is that the weather conditions were collected at the Study Area scale, not the plot level scale. Therefore, we do not need to include Plot in our query. Remove plot from the SELECT and the GROUP BY. You should now return 101 rows.
    • To store these data only once, we will create a new table, tblWeather, by using the results from our query. We are using the same SELECT query, but we are adding an INTO command so that the results are pushed to a new static table:

      SELECT SampDate, Moon, Precip_prev_24_hours, Precip_current_day, Humidity_L, Humidity_H, Temp_L, Temp_H
      INTO tblWeather
      FROM tblFieldData
      GROUP BY SampDate, Moon, Precip_prev_24_hours, Precip_current_day, Humidity_L, Humidity_H, Temp_L, Temp_H
      ORDER BY SampDate
       
    • Refresh your Tables and you will see the new table. 
       
    • Now Delete the columns Moon, Precip_prev_24_hours, Precip_current_day, Humidity_L, Humidity_H, Temp_L, Temp_H from tblFieldData because we no longer need them in this table.
       
      • Right-click on tblFieldData and select Design
         
      • Highlight the columns, Right-click and select Delete Columns
         
      • Save your changes and close the tab
         
  8. The final column is BasalArea_m2/ha. These values are at the plot level. Where are we storing Plot information?


    That's right, in the fcPlots table. We need to push this data to fcPlots, matching the values by Plot ID. 
     
    • First we need to create a new column within fcPlots
       
      • Right-click on fcPlots, select Design.
         
      • Add a new column at the end named 'BasalArea_m2_ha'. Type = float
         
      • Save the changes. Close the tab
         
    • Now update the values in fcPlots by using the existing values from tblFieldData. We are using JOIN on the common Plot ID fields. Even though they are not the exact same column name, we can for the JOIN because we know the values are supposed to match

      UPDATE fcPlots
      SET fcPlots.BasalArea_m2_ha = tblFieldData.[BasalArea_m2/ha]
      FROM tblFieldData INNER JOIN
      fcPlots ON tblFieldData.Plot = fcPlots.PLOTID
       
    • Check and see if the results meet your expectations

      SELECT *
      FROM fcPlots
       
    • Now delete the BasalArea field from tblFieldData. Save the changes.
       
  9. The next step is to begin creating our Primary Keys within each table. Recall that a PK is a way to guarantee data integrity by enforcing that the Column or Columns will create a Unique Row of data. This is a SUPER IMPORTANT step and will make your database optimized and keep your data correct as new information is added.
     
    • Right-click on tblFieldData and choose Design. We know that Plot, PlotArray, SampDate and SpecCode combined create a Unique Observation. Highlight the 4 columns, then click the 'Key' button. You should now see the four columns with a key next to them. This creates a Composite Primary Key (a Natural Key).
       
      • Click Save ....I bet you got an error, indicating that we have duplicate rows. This is an issue, sort of. The reason that we cannot enforce this key is because more than one animal of the same species was captured within the same PlotArray at the same date. It would be easy to deal with if they had marked each Individual with a unique tag. That didn't happen, so in this case, we need to simply relax our enforcement of Integrity Rules. 
         
      • Exit out of this tab and do not save changes.
         
      • To create true unique set of columns, we would have to create a SampleID column that is specific to each row of data (i.e. an incremental number associated to the above columns...for example 1, 2, 3  then 1, 2 for the next set of species. This is tedious but probably necessary)
        • Add a Temporary new column called ID (int) (in the Column Properties, set the Identity Specification to = Yes), Click Save
          • Next Highlight the ID column and set it as the PK. This creates a unique row with an AutoNumber and will allow us to edit the table to create the Natural Keys that we desire
             
        • Add a new column to tblFieldData (Insert a column named SampleID (int) )
           
        • Now run a query to determine what you are up against as far as duplicate potential keys

          SELECT Plot, PlotArray, SampDate, SpecCode, COUNT(*)
          FROM tblFieldData
          GROUP BY Plot, PlotArray, SampDate, SpecCode
           
        • You will notice that up to 4 of the same species were caught on the same date and trap
           
        • UPDATE this column with new values of the COUNT. 

          UPDATE [dbo].[tblFieldData]
             SET [SampleID] = Cnt
                FROM            (SELECT        Plot, PlotArray, SampDate, SpecCode, COUNT(*) AS Cnt
                                    FROM            tblFieldData
                                    GROUP BY Plot, PlotArray, SampDate, SpecCode) AS Src INNER JOIN
                                   tblFieldData AS tblFieldData_1 ON Src.Plot = tblFieldData_1.Plot AND Src.PlotArray = tblFieldData_1.PlotArray AND Src.SampDate = tblFieldData_1.SampDate AND 
                                   Src.SpecCode = tblFieldData_1.SpecCode
           
        • This will populate the majority of the SampleIDs with the correct value. However, where the count was >2 you need to edit those rows
          • ​Right-click on tblFieldData and choose Edit Top 200 Rows
          • In the Toolbar, click the SQL button and replace the the SQL statement with this one

            SELECT        Plot, PlotArray, SampDate, SpecCode, SampleID
            FROM            tblFieldData
            ORDER BY Plot, PlotArray, SampDate, SpecCode
             
          • Where ever you see >1 value, make the rows have a unique value, so if the value is 2, in the first row replace the 2 with a 1 so that it is sequential for the unique set of Plot, PlotArray, SampDate, SpecCode......YES, this totally is a lame way of correcting data, but we must learn to walk before we can run.....
             
          • Once this 'eye-blurring, mind-numbing' task is complete. Go back to the Design of tblFieldData:
            • Make sure that the first 5 columns are in this order: Plot, PlotArray, SampDate, SpecCode, SampleID
            • Hightlight them and Make them be the PK. Click save.....if you successfully save with no errors, then select the ID column and delete it. IF you can't save the table, you missed one of the duplicate values. Find your error, correct it, try again.
               
    • Right-click on luSpecies and choose Design.
      • Highlight SpecCode then click the 'Key' button. 
         
      • Hit the Save and close the tab
         
    • Right-click on tblWeather and choose Design
      • Highlight SampDate and make this the PK
      • You get an error. Let's look at the data and see where the mistake is.

        SELECT SampDate, Count(*)
        FROM tblWeather
        GROUP BY SampDate
         
      • There are 3 dates with duplicates (8/1/2008, 8/4/2008, 8/8/2008). We need to fix these because they are typos. Right-click on tblWeather and Choose 'Edit Top 200 Rows'. This is how you can change values manually within your table.
         
        • Scroll down to the 8/1/2008 rows. You will notice that the row with Precip=0.6 is a typo. Delete the row.
           
        • Scroll down to the 8/4/2008 rows. You will notice that the row with Precip=0 is a typo. Delete the row.
           
        • Scroll down to the 8/8/2008 rows. You will notice that the row with CurrentPrecip=0 is a typo. Delete the row.
           
        • Close the Tab, Try and create the PK again. It should work this time. Close this tab
           
    • Right-click on fcPlots and choose Design
      • The PLOTID column is our true PK. Save your changes
         
    • Right-click on fcTrapArrays and choose Design
      • The Plot, PlotArray columns are our true PK. Save your changes
         
  10.  Almost there...Next let's create a lookup table to constrain values within the tblFieldData.DirOfMvmt field. We can do this by simply creating a new data from query results (we have done this already).
    • Query out the unique values:

      SELECT DirOfMvmt
      FROM tblFieldData
      GROUP BY DirOfMvmt
       
    • OK, these results look reasonable. Put them into a table

      SELECT DirOfMvmt
      INTO luDirOfMvmt
      FROM tblFieldData
      GROUP BY DirOfMvmt
       
    • Great, make the DirOfMvmt column the PK.....now add a new column to your new table: Col_Descript (nvarchar(255)). You can now add details about the values that are coded. So, in essence, you are creating metadata for your column. If you feel like you need other columns to describe the units or other attributes of your PK column, then please do that. Lookup tables are INVALUABLE for helping to describe your data without having to store a bunch of extra info over and over and over within your actual 'data table'. Click Save
       
    • Perform the above process for the following columns: Sex, Age and Breeding
       
  11. To finish up the database schema, we need to create the relationships between each of our tables so that we can Enforce Referential Integrity.
     
    • Within YOUR database, Right-click on 'Database Diagrams'. Choose 'New Database Diagram'. There may be an error that gets thrown at you, but hopefully it is inconsequential and you can move forward.
       
    • Using the Dialog, add all of your tables to the diagram. NOTE: this isn't a very smart function and it will place your tables in a somewhat haphazard manner. You will end up moving the tables around the Design to match your idea of the data hierarchy.
       
    • I am choosing the move the fcPlots table to the far left because it is the highest level/scale of our data. It is the Parent.
       
    • Next, move fcTrapArrays to be just to the right of fcPlots....(i.e. TrapArray is a Child of Plot)
       
      • To create a relationship between two tables, the Common fields MUST be of the same Data Type and Length. You will notice that fcPlots.PLOTID is a character, whereas in the other tables the value is 'int'. Change fcPlots.PLOTID to int. Save your changes...
         
      • The common field between these two tables is PlotID. IT IS IMPORTANT that you execute this next step in the correct order. When creating Relationships, you must go From the Child To the Parent. So, Click and Drag fcTrapArray.Plot over to fcPlots.PLOTID. A Dialog will pop-up. Make sure that the Primary key is PLOTID and the Foreign key is Plot (i.e. the direction of the relationship matters). What we are saying with this relationship is that if a Value doesn't exist in the Parent table, it CANNOT exist in the Child table. This helps us make sure we don't have errors within our database and that all of the values are expected, not random.


         
      • Click OK
         
      • In the next Dialog, Expand 'INSERT and UPDATE'. Change both items to be 'Cascade'. Click OK. The reason that we want to 'Cascade' changes is what makes relational databases so awesome. If we make a change in the Parent table, the matching data within the Child tables will also get updated. This function/process helps us guarantee Data Integrity and that we don't introduce silly errors into our data model.


         
      • Now Save the Diagram, naming it your database name_Design (e.g. AlbekeDB_Design). Once you save and you don't receive errors, then the relationship that you just created matches database rules and life is good!
         
    • Create the relationship between fcTrapArrays (Parent) and tblFieldData (Child).
      • The common Columns between the tables is Plot and PlotArray. Before you do anything, make sure that the Data Types match between the tables. If they don't, change the table design(s) to make them match.
      • Click and Drag tblFieldData.Plot over to fcTrapArrays.Plot. In the Foreign section add in PlotArray and click OK


         
      • Remember to change the Cascade.....You will probably get the following error:


         
      • The error occurs because the Child table has a value doesn't exist within the Parent. After a little snooping around using an OUTER JOIN (see above), I was able to find the error. To fix the one row of missing data, run the following query:

        UPDATE tblFieldData
        SET PlotArray='SW'
        WHERE PlotArray='-'

         
      • Now try to create the Relationship, it should save ok now. Make sure to save the Entire Design to make sure the changes were committed.
         
      • Perform the same steps to create the appropriate Relationships the other tables: tblWeather, luSpecies, luSex, luAge and luBreeding. DO NOT FORGET the Cascades...