Adding back trailing zeros in an identification column
Clash Royale CLAN TAG#URR8PPP
Adding back trailing zeros in an identification column
I have the following data which looks like the following;
COLUPC UPC
19638 715643501208 00-01-82000-72608
11783 15230000022 00-01-01820-00198
17140 18769210012 07-01-93271-20600
19624 711925600019 00-01-86243-11167
18706 28382012393 27-01-15503-01673
8716 13410025574 00-01-78250-00236
The columns COLUPC
and UPC
come from two different data frames and a random sample of each was taken. The COLUPC
and UPC
are the same unique identifiers. It is just that COLUPC
has lost its format.
COLUPC
UPC
COLUPC
UPC
COLUPC
Taking the first row value for COLUPC
- 715643501208
. It should look like 07-01-56435-01208
following the same structure as in column UPC
.
COLUPC
715643501208
07-01-56435-01208
UPC
It gets a little more tricky with observation 15230000022
It should look like 01-05-23000-00022
where there was an extra 0
added in the 4th component 00022
along with component 1 and 2.
15230000022
01-05-23000-00022
0
00022
The documentation that comes along with the data file states that COLUPC
has no leading zeros shown.
COLUPC
So;
Component 1 is 2 digits
Component 2 is 2 digits (1 digit with a 0 at the begining)
Component 3 is 5 digits
Component 4 is 5 digits
Taking this number 00-01-01820-00198
from the UPC
column. In the COLUPC
column this would collapse to 11820198
where all the leading zeros have been removed from component 1, 2, 3 and 4.
00-01-01820-00198
UPC
COLUPC
11820198
If I am a little unclear let me know.
Data:
structure(list(COLUPC = c(715643501208, 15230000022, 18769210012,
711925600019, 28382012393, 13410025574, 716708427455, 12210000157,
13410025602, 17143945712, 17336011341, 13410062505, 11820053218,
11820002989, 13410059505, 11820043550, 18417331130, 711820053025,
18982650001, 18248813000, 17199077603, 18834510005, 13410048602,
18417310545, 11820000987, 13410000217, 17089701331, 13410017256,
618516510302, 23410030602, 17336080301, 613498630003, 17825000257,
16206705155, 17031000013, 13410000555, 16206704972, 37031030703,
27336082301, 17031023882, 18769227102, 18382056793, 31820000697,
13410048574, 17199048004, 11820041550, 11820000884, 18572511509,
14182700012, 13410010505, 18700050898, 18248812910, 18015761147,
18549873837, 16821309117, 618516510402, 17231163009, 611969900006,
11820011349, 11820000769, 13410046805, 13410000162, 13410000579,
18374110012, 21820000801, 17231163011, 23410068505, 17199030030,
11820000771, 17031000009, 28549873736, 11820000784, 17199048016,
13410057256, 11820054028, 11820001991, 13410057602, 17336080341,
21820029031, 715643501210, 17231163012, 17199047004, 18769201103,
13410000623, 15230000540, 618516510704, 17336011348, 15230001301,
618516510707, 17143934806, 18572511510, 28248800006, 13410057536,
611969900002, 11820041047, 13410057340, 18769241103, 17199000025,
21820000803, 14182700712), UPC = c("00-01-82000-72608", "00-01-01820-00198",
"07-01-93271-20600", "00-01-86243-11167", "27-01-15503-01673",
"00-01-78250-00236", "00-01-18200-96664", "00-01-99232-00055",
"27-01-15504-02696", "00-01-34100-15306", "00-01-95301-13152",
"00-03-22405-55555", "00-02-76398-50000", "00-01-82000-72750",
"00-01-88573-50001", "00-01-72890-00122", "00-02-18200-00406",
"00-03-18200-01168", "00-01-83820-73500", "27-01-00007-13019",
"00-01-82548-20702", "07-01-19283-60388", "07-01-14834-00121",
"07-01-02596-10001", "27-01-15503-03400", "06-01-30279-33706",
"27-01-04200-18188", "07-01-80369-61992", "00-01-82153-19309",
"00-01-34100-59205", "27-01-15504-00652", "00-02-18200-86030",
"07-01-08820-49939", "27-01-04203-24019", "27-01-15503-00627",
"27-01-15504-01210", "27-01-15504-01305", "00-02-18200-00006",
"00-01-85498-66662", "00-01-50939-12201", "00-01-95374-70000",
"00-01-81986-00187", "00-01-72620-20101", "07-02-19256-00043",
"00-01-68213-05117", "07-01-86243-01106", "00-03-21242-00004",
"27-01-15503-00969", "00-01-62067-38055", "07-01-36040-00657",
"00-01-78250-00388", "00-01-62067-04971", "00-01-18200-15047",
"27-01-00001-60539", "00-01-70378-54652", "00-01-52300-00005",
"27-01-15504-01620", "00-01-70310-35738", "07-03-36920-11141",
"27-01-15503-00049", "00-01-80480-90593", "00-01-18200-27168",
"00-01-34100-14574", "07-01-92098-07234", "27-01-00007-05978",
"06-01-05748-02000", "00-01-34100-92528", "06-01-43775-10306",
"27-01-15503-01146", "27-01-00007-06142", "07-03-02770-00802",
"00-04-62067-00900", "27-01-15502-01229", "00-01-73832-06340",
"00-01-18200-00180", "07-01-94028-20030", "27-01-15504-02559",
"07-01-90586-00122", "00-01-92212-11102", "07-01-80369-91993",
"00-01-18200-18500", "00-01-18200-00997", "00-01-18200-41372",
"00-01-34100-50505", "00-01-92212-12111", "27-01-15503-00078",
"27-01-15503-07766", "00-01-80069-00221", "27-01-15504-03101",
"00-02-77940-00002", "00-01-02770-01302", "07-01-27658-12302",
"00-01-34100-62152", "00-01-72620-20128", "00-01-71990-77952",
"00-01-85725-21004", "07-01-08820-58624", "07-02-88946-10141",
"00-01-81986-00144", "00-01-18200-00242")), class = "data.frame", row.names = c(19638L,
11783L, 17140L, 19624L, 18706L, 8716L, 19653L, 5259L, 8738L,
13844L, 15570L, 11543L, 3919L, 1345L, 11412L, 3122L, 16899L,
19583L, 17617L, 16753L, 15011L, 17582L, 9364L, 16889L, 1310L,
5602L, 13696L, 8276L, 19528L, 18189L, 15840L, 19217L, 15982L,
13285L, 13568L, 5737L, 13263L, 19052L, 18282L, 13613L, 17300L,
16882L, 18785L, 9214L, 14925L, 2868L, 1009L, 17053L, 11674L,
7364L, 17077L, 16752L, 15990L, 17038L, 13449L, 19530L, 15166L,
19167L, 2539L, 542L, 9054L, 5597L, 5757L, 16754L, 17682L, 15167L,
18203L, 14441L, 657L, 13551L, 18748L, 747L, 14953L, 9903L, 4182L,
1330L, 11094L, 15936L, 17768L, 19642L, 15173L, 14921L, 17090L,
5759L, 12765L, 19579L, 15588L, 12885L, 19581L, 13814L, 17054L,
18697L, 11085L, 19136L, 2764L, 10523L, 17465L, 13960L, 17724L,
11687L))
1 Answer
1
I think you can do this with two gsub
statements, based on nchar
. I'm making the assumption here that for an 11 digit value in COLUPC
, you are always missing a zero on the fourth group. And I called your data frame df1
in my sapply
call. If you named yours something different you'll need to swap in your name.
gsub
nchar
COLUPC
df1
sapply
We'll define a helper function to pass to sapply
:
sapply
fixUPC <- function(x)
if (nchar(x) == 11)
# we catch each Component in a separate group, then reconstruct in
# our substitution adding zeros and dashes as necessary
gsub("^(\d)(\d)(\d5)(\d4)","0\1-0\2-\3-0\4",x)
else if (nchar(x) == 12)
gsub("^(\d)(\d)(\d5)(\d5)","0\1-0\2-\3-\4",x)
df1$fixedUPC <- unlist(sapply(df1$COLUPC, fixUPC))
head(df1$fixedUPC)
[1] "07-01-56435-01208" "01-05-23000-00022" "01-08-76921-00012"
[4] "07-01-19256-00019" "02-08-38201-02393" "01-03-41002-05574"
You could also extend this to situations where you lost more than 3 characters, i.e. nchar == 10
, however, as you lose more numbers, your ability to accurately fill in zeros will depend on how consistent the UPCs were to begin with. For your last example of 11820198
, it becomes ambiguous whether the original UPC was 00-01-01820-00198
or 01-01-82019-00008
. If you can say with certainty that the zeros always go one place or another for a certain nchar
value, you're good to go, but that may not be true in your full data set.
nchar == 10
11820198
00-01-01820-00198
01-01-82019-00008
nchar
I will add a small reproducible example
– user113156
10 mins ago
Just added
unlist()
should be okay now– Mako212
54 secs ago
unlist()
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thanks you this does work however it stores the UPC as a list: structure ` $ COLUPC : num 1.09e+10 1.09e+10 1.09e+10 1.09e+10 1.09e+10 ... $ UPC :List of 20015 ..$ : chr "01-00-94390-00008"`
– user113156
10 mins ago