ibmi-brunch-learn

Announcement

Collapse
No announcement yet.

Storing data vertically

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Storing data vertically

    This is sort of a theoretical question. It's unfamiliar territory to me.

    In the relational model, data is stored in tables, one row per entity. Each table has attributes (aka fields or columns). One attribute, or one group of attributes, uniquely identifies each row. I'm not telling you anything you don't know. For purposes of this discussion, I will refer to this as storing data horizontally.

    Another way to store data is what I will call storing data vertically. Each row is one attribute of an entity. The columns are:
    • key value
    • attribute name
    • attribute value

    So, here's a customer stored horizontally:

    Account number Name Street City State ZIP code
    12345 Bertha D. Blooze 101 Rampart St New Orleans LA 70114

    Here's the same data stored vertically.

    Key Attribute name Attribute value
    12345 Name Bertha D. Blooze
    12345 Street 101 Rampart St.
    12345 City New Orleans
    12345 State LA
    12345 Zip code 70114

    The vertical method may be relational technically (the data is in a table and each row is uniquely identified by key and attribute name), but the fact that all attributes of an entity are not in the same row is enough reason for me not to consider it relational.

    My question is this: Have you ever worked with vertically-stored data, and if so, what was your experience with it?

    Ted



  • #2
    I've not worked with this structure, but there are several reasons I would avoid it!
    • You lose data types. Not only does this mean that you can't store numeric values or dates in their native form, but it also means you can't apply foreign key constraints
    • Similarly, it's more difficult to enforce the presence of data. Rather than declaring a column as NOT NULL, your only option now is a trigger to validate that the attribute has been set. This will have worse performance and increased code complexity.
    • You will more quickly hit the maximum number of rows permitted by the database. This is about 4 billion currently - obviously not all tables will get there, but some might!
    • Indexing becomes more tricky. You could create sparse indexes, that only contain entries for certain attributes, but those are going to hold more data than an index only containing the "attribute value" data
    • Lastly, queries are awful to write. Instead of writing:
    Code:
    SELECT name, state, zip_code
    FROM customer
    WHERE customer_id = 12345;
    You now have to do several joins, which are horrible to read and will likely run more slowly:

    Code:
    SELECT name_tbl.attribute_value AS name,
           state_tbl.attribute_value AS state,
           zip_tbl.attribute_value AS zip_code
    FROM customer_kav AS name_tbl
      JOIN customer_kav AS state_tbl
        ON state_tbl.key = name_tbl.key
      JOIN customer_kav AS zip_tbl
        ON zip_tbl.key = name_tbl.key
    WHERE name_tbl.attribute = 'Name'
          AND state_tbl.attribute = 'State'
          AND zip_tbl.attribute = 'Zip code'
          AND name_tbl.key = 12345;
    It does have its uses though. Say you were a vendor, and you wanted to let your clients decide exactly what customer data to capture with your system. They could request this new attribute to be added traditionally, of course, but they may also appreciate the ability to do this without commissioning a software change and organising an upgrade.

    It could also be useful where you somehow have thousands of possible attributes, but a typical entity will only ever have a few populated. You may store less data with a key-attribute-value approach, and your queries may perform better. I struggle to think of any scenario like that off-hand, though. I would probably be more tempted to store the data as JSON, and put that into the table instead (e.g. in a CLOB).

    Comment


    • #3
      Ah, generic key/value tables

      Our application is fairly old and most of it is written using native IO (F-specs) and, for some reason, a separate servcice program for each older table with dedicated procedures to CHAIN/SETxx/READx for each key, using the table itself as the parameter DS. We don't write those any more, but plenty exist.

      And that means there is a significant reluctance to amend tables to add columns, due to the number of affected programs that would need to be modified or recompiled. So there has been a tendency to use code/value tables where adding columns to existing tables would have been the proper solution.

      Comment


      • #4
        Thanks for the replies. I would have been surprised if many people had responded that they had experience with this sort of thing. I have no experience with it either.

        MartinTosney , you thought of some disadvantages I had not thought of.

        Comment


        • #5
          TH - It's hard to see the utility of this in a system of record. It might be useful in an analytical situation. A variation of the idea would be each attribute being a separate table, although I'm not sure of the utility of that either except maybe in analysis.

          MT - Just for grins, three rpg chains could accomplish the same thing as that sql. We're fortunate to not be constrained to only sql the way the poor souls stuck on other platforms are.

          Comment


          • #6
            Originally posted by UserName10 View Post
            MT - Just for grins, three rpg chains could accomplish the same thing as that sql. We're fortunate to not be constrained to only sql the way the poor souls stuck on other platforms are.
            I certainly don't disagree it would read better

            Even though this seems like a throw-away comment, I'm going to address it seriously. Let's make sure we're comparing apples to apples. Would you rather code:

            Code:
            chain (customerId : 'Name') cust_kav kav;
            if %found(cust_kav);
              name = kav.attribute_value;
            
              chain (customerId : 'State') cust_kav kav;
              if %found(cust_kav);
                state = kav.attribute_value;
              endif;
              chain (customerId : 'Zip Code') cust_kav kav;
              if %found(cust_kav);
                zipcode = kav.attribute_value;
              endif;
            
              doSomething(name : state : zipcode);
            endif;
            Or:

            Code:
            chain (customerId) customer cust;
            if %found(customer);
              doSomething(cust.name : cust.state : cust.zipcode);
            endif;

            Comment


            • #7
              MartinTosney, there's no doubt in my mind that the single chain is superior to the three chains. Your example is good, but there's more.
              1. The three-chain approach does not account for one or more chains failing. What if the first and third chains succeed and the second one fails? doSomething uses whatever value happens to be in State. Therefore, the example needs to clear name, state and zipcode before the first chain.
              2. doSomething is contingent on the first chain succeeding, but what if it fails, yet there is data for the other two fields? doSomething should be conditioned to at least one of the chains succeeding.
              As for SQL, this is difficult enough with native I/O. I wouldn't think of trying to use SQL, which has led me to an observation I had never thought of. Maybe a new rule of thumb should be, "If you can't easily query the table with SQL, the data is probably not relational." Or maybe I should say it's probably not normalized.

              Comment


              • #8
                Instead of 3 chains or coding multiple JOINs, I'd generate a view which converts the data in a relational form, so it can easily be joined.
                Code:
                 Create View yourSchema.YourView as
                Select Key, Min(Case When AttrName = 'Name' Then AttrValue End) as Name,
                      Min(Case When AttrName = 'Street' Then AttrValue End) as Street,   
                      Min(Case When AttrName = 'City' Then AttrValue End) as City,
                      Min(Case When AttrName = 'State' Then AttrValue End) as State,
                      Min(Case When AttrName = 'Zipcode' Then AttrValue End) as ZipCode
                   from x
                   Group By Key;

                Comment

                Working...
                X