Adding back trailing zeros in an identification column

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Adding back trailing zeros in an identification column



I have the following data which looks like the following;


COLUPC UPC
19638 715643501208 00-01-82000-72608
11783 15230000022 00-01-01820-00198
17140 18769210012 07-01-93271-20600
19624 711925600019 00-01-86243-11167
18706 28382012393 27-01-15503-01673
8716 13410025574 00-01-78250-00236



The columns COLUPC and UPC come from two different data frames and a random sample of each was taken. The COLUPC and UPC are the same unique identifiers. It is just that COLUPC has lost its format.


COLUPC


UPC


COLUPC


UPC


COLUPC



Taking the first row value for COLUPC - 715643501208. It should look like 07-01-56435-01208 following the same structure as in column UPC.


COLUPC


715643501208


07-01-56435-01208


UPC



It gets a little more tricky with observation 15230000022 It should look like 01-05-23000-00022 where there was an extra 0 added in the 4th component 00022 along with component 1 and 2.


15230000022


01-05-23000-00022


0


00022



The documentation that comes along with the data file states that COLUPC has no leading zeros shown.


COLUPC



So;



Component 1 is 2 digits



Component 2 is 2 digits (1 digit with a 0 at the begining)



Component 3 is 5 digits



Component 4 is 5 digits



Taking this number 00-01-01820-00198 from the UPC column. In the COLUPC column this would collapse to 11820198 where all the leading zeros have been removed from component 1, 2, 3 and 4.


00-01-01820-00198


UPC


COLUPC


11820198



If I am a little unclear let me know.



Data:


structure(list(COLUPC = c(715643501208, 15230000022, 18769210012,
711925600019, 28382012393, 13410025574, 716708427455, 12210000157,
13410025602, 17143945712, 17336011341, 13410062505, 11820053218,
11820002989, 13410059505, 11820043550, 18417331130, 711820053025,
18982650001, 18248813000, 17199077603, 18834510005, 13410048602,
18417310545, 11820000987, 13410000217, 17089701331, 13410017256,
618516510302, 23410030602, 17336080301, 613498630003, 17825000257,
16206705155, 17031000013, 13410000555, 16206704972, 37031030703,
27336082301, 17031023882, 18769227102, 18382056793, 31820000697,
13410048574, 17199048004, 11820041550, 11820000884, 18572511509,
14182700012, 13410010505, 18700050898, 18248812910, 18015761147,
18549873837, 16821309117, 618516510402, 17231163009, 611969900006,
11820011349, 11820000769, 13410046805, 13410000162, 13410000579,
18374110012, 21820000801, 17231163011, 23410068505, 17199030030,
11820000771, 17031000009, 28549873736, 11820000784, 17199048016,
13410057256, 11820054028, 11820001991, 13410057602, 17336080341,
21820029031, 715643501210, 17231163012, 17199047004, 18769201103,
13410000623, 15230000540, 618516510704, 17336011348, 15230001301,
618516510707, 17143934806, 18572511510, 28248800006, 13410057536,
611969900002, 11820041047, 13410057340, 18769241103, 17199000025,
21820000803, 14182700712), UPC = c("00-01-82000-72608", "00-01-01820-00198",
"07-01-93271-20600", "00-01-86243-11167", "27-01-15503-01673",
"00-01-78250-00236", "00-01-18200-96664", "00-01-99232-00055",
"27-01-15504-02696", "00-01-34100-15306", "00-01-95301-13152",
"00-03-22405-55555", "00-02-76398-50000", "00-01-82000-72750",
"00-01-88573-50001", "00-01-72890-00122", "00-02-18200-00406",
"00-03-18200-01168", "00-01-83820-73500", "27-01-00007-13019",
"00-01-82548-20702", "07-01-19283-60388", "07-01-14834-00121",
"07-01-02596-10001", "27-01-15503-03400", "06-01-30279-33706",
"27-01-04200-18188", "07-01-80369-61992", "00-01-82153-19309",
"00-01-34100-59205", "27-01-15504-00652", "00-02-18200-86030",
"07-01-08820-49939", "27-01-04203-24019", "27-01-15503-00627",
"27-01-15504-01210", "27-01-15504-01305", "00-02-18200-00006",
"00-01-85498-66662", "00-01-50939-12201", "00-01-95374-70000",
"00-01-81986-00187", "00-01-72620-20101", "07-02-19256-00043",
"00-01-68213-05117", "07-01-86243-01106", "00-03-21242-00004",
"27-01-15503-00969", "00-01-62067-38055", "07-01-36040-00657",
"00-01-78250-00388", "00-01-62067-04971", "00-01-18200-15047",
"27-01-00001-60539", "00-01-70378-54652", "00-01-52300-00005",
"27-01-15504-01620", "00-01-70310-35738", "07-03-36920-11141",
"27-01-15503-00049", "00-01-80480-90593", "00-01-18200-27168",
"00-01-34100-14574", "07-01-92098-07234", "27-01-00007-05978",
"06-01-05748-02000", "00-01-34100-92528", "06-01-43775-10306",
"27-01-15503-01146", "27-01-00007-06142", "07-03-02770-00802",
"00-04-62067-00900", "27-01-15502-01229", "00-01-73832-06340",
"00-01-18200-00180", "07-01-94028-20030", "27-01-15504-02559",
"07-01-90586-00122", "00-01-92212-11102", "07-01-80369-91993",
"00-01-18200-18500", "00-01-18200-00997", "00-01-18200-41372",
"00-01-34100-50505", "00-01-92212-12111", "27-01-15503-00078",
"27-01-15503-07766", "00-01-80069-00221", "27-01-15504-03101",
"00-02-77940-00002", "00-01-02770-01302", "07-01-27658-12302",
"00-01-34100-62152", "00-01-72620-20128", "00-01-71990-77952",
"00-01-85725-21004", "07-01-08820-58624", "07-02-88946-10141",
"00-01-81986-00144", "00-01-18200-00242")), class = "data.frame", row.names = c(19638L,
11783L, 17140L, 19624L, 18706L, 8716L, 19653L, 5259L, 8738L,
13844L, 15570L, 11543L, 3919L, 1345L, 11412L, 3122L, 16899L,
19583L, 17617L, 16753L, 15011L, 17582L, 9364L, 16889L, 1310L,
5602L, 13696L, 8276L, 19528L, 18189L, 15840L, 19217L, 15982L,
13285L, 13568L, 5737L, 13263L, 19052L, 18282L, 13613L, 17300L,
16882L, 18785L, 9214L, 14925L, 2868L, 1009L, 17053L, 11674L,
7364L, 17077L, 16752L, 15990L, 17038L, 13449L, 19530L, 15166L,
19167L, 2539L, 542L, 9054L, 5597L, 5757L, 16754L, 17682L, 15167L,
18203L, 14441L, 657L, 13551L, 18748L, 747L, 14953L, 9903L, 4182L,
1330L, 11094L, 15936L, 17768L, 19642L, 15173L, 14921L, 17090L,
5759L, 12765L, 19579L, 15588L, 12885L, 19581L, 13814L, 17054L,
18697L, 11085L, 19136L, 2764L, 10523L, 17465L, 13960L, 17724L,
11687L))




1 Answer
1



I think you can do this with two gsub statements, based on nchar. I'm making the assumption here that for an 11 digit value in COLUPC, you are always missing a zero on the fourth group. And I called your data frame df1 in my sapply call. If you named yours something different you'll need to swap in your name.


gsub


nchar


COLUPC


df1


sapply



We'll define a helper function to pass to sapply:


sapply


fixUPC <- function(x)
if (nchar(x) == 11)
# we catch each Component in a separate group, then reconstruct in
# our substitution adding zeros and dashes as necessary
gsub("^(\d)(\d)(\d5)(\d4)","0\1-0\2-\3-0\4",x)
else if (nchar(x) == 12)
gsub("^(\d)(\d)(\d5)(\d5)","0\1-0\2-\3-\4",x)



df1$fixedUPC <- unlist(sapply(df1$COLUPC, fixUPC))

head(df1$fixedUPC)

[1] "07-01-56435-01208" "01-05-23000-00022" "01-08-76921-00012"
[4] "07-01-19256-00019" "02-08-38201-02393" "01-03-41002-05574"



You could also extend this to situations where you lost more than 3 characters, i.e. nchar == 10, however, as you lose more numbers, your ability to accurately fill in zeros will depend on how consistent the UPCs were to begin with. For your last example of 11820198, it becomes ambiguous whether the original UPC was 00-01-01820-00198 or 01-01-82019-00008. If you can say with certainty that the zeros always go one place or another for a certain nchar value, you're good to go, but that may not be true in your full data set.


nchar == 10


11820198


00-01-01820-00198


01-01-82019-00008


nchar





Thanks you this does work however it stores the UPC as a list: structure ` $ COLUPC : num 1.09e+10 1.09e+10 1.09e+10 1.09e+10 1.09e+10 ... $ UPC :List of 20015 ..$ : chr "01-00-94390-00008"`
– user113156
10 mins ago





I will add a small reproducible example
– user113156
10 mins ago





Just added unlist() should be okay now
– Mako212
54 secs ago


unlist()






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard